Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8212618

A low-cost, always on statistical value history

    • Icon: Enhancement Enhancement
    • Resolution: Won't Fix
    • Icon: P3 P3
    • None
    • 12
    • hotspot
    • None
    • svc

      SAP developed a supportability feature called "Statistics History" which is a low cost history of statistical values-of-interest.

      These values contain parameters of the jvm (e.g. heap size, metaspace size, number of loaded classes) and the underlying platform (e.g. rss, swapping state, run queue length etc). At intervals of (by default) 60 seconds these values are measured and stored in a fifo buffer.

      The fifo buffer has three parts, a short-, medium-, and long term fifo buffer. A fraction of the samples falling out of the short term fifo is transferred to the mid term fifo; again, a fraction of the samples falling out of the mid term fifo is transferred to the long term fifo. So, the short term fifo covers a short recent timespan (usually an hour) in comparativly short sample intervals (usually 60 seconds), whereas the long term fifo covers a very long time span (~10 days) with interval times of hours.

      This feature has been very popular with our support folks and so we would like to contribute that. It enables us to easily analyze slowly developing situations like memory leaks, memory or cpu spikes, resource starvation etc.

      The aim of this feature is not to replace "real" profilers like the JMC; rather to be a cheap, always-on first stop to get a rough estimate on what is going on.

          [JDK-8212618] A low-cost, always on statistical value history

          Thomas Stuefe added a comment -
          After discussion on serviceability-dev it has been decided to not bring this patch upstream.

          For details, see: http://mail.openjdk.java.net/pipermail/serviceability-dev/2018-November/025909.html and https://github.com/tstuefe/ojdk-stathist-patch.

          Thomas Stuefe added a comment - After discussion on serviceability-dev it has been decided to not bring this patch upstream. For details, see: http://mail.openjdk.java.net/pipermail/serviceability-dev/2018-November/025909.html and https://github.com/tstuefe/ojdk-stathist-patch .

          Erik Gahlin added a comment -
          > The intent of this feature is to be very simple, very cheap and very robust to the point that it can be left on by default and basically forgotten;

          That was the intent with JFR as well.

          > it is designed specifically to continue working under crash situations when nothing else wouldn't (the history is in our port printed as part of error reporting - which once was its main purpose - and now also via jcmd).

          JFR can handle crash situtation. It will write data in buffers in an emergency dump next to the hs_err file. There are also jcmds for JFR you can use to start, stop, configure and dump recordings. That said, there are probably improvement that can be made.

          > The current version of our statistics takes a few dozens kB as mem footprint (allocated upfront, so it never can run oom), cpu footprint is not measurable.

          JFR requires more than an dozen kBs. The amount of buffer memory being used can be configured. Not sure what the minimum would be in the current implementation, but perhaps 1-2 MB.

          >I'm not sure how that compares to JFR. The latter strikes me as way more expensive, more fragile (since it uses more infrastructure which may be corrupted in crash situations, and allocates its resources dynamically) and hence less suited for crash analysis.

          JFR has been used in production for a decade to solve JVM issues. The overhead depends on the events being enabled. For instance, GC and compiler events require very little overhead. Probably not measurable.

          > Workflow is also different - our statistic is strictly terminal-oriented. JFR seems to be dialed more towards full blown java performance analysis, less to VM introspection.

          JFR was built for troubleshooing the JVM. Other features have been added over the years, but it is not necessary to enable them.

          Erik Gahlin added a comment - > The intent of this feature is to be very simple, very cheap and very robust to the point that it can be left on by default and basically forgotten; That was the intent with JFR as well. > it is designed specifically to continue working under crash situations when nothing else wouldn't (the history is in our port printed as part of error reporting - which once was its main purpose - and now also via jcmd). JFR can handle crash situtation. It will write data in buffers in an emergency dump next to the hs_err file. There are also jcmds for JFR you can use to start, stop, configure and dump recordings. That said, there are probably improvement that can be made. > The current version of our statistics takes a few dozens kB as mem footprint (allocated upfront, so it never can run oom), cpu footprint is not measurable. JFR requires more than an dozen kBs. The amount of buffer memory being used can be configured. Not sure what the minimum would be in the current implementation, but perhaps 1-2 MB. >I'm not sure how that compares to JFR. The latter strikes me as way more expensive, more fragile (since it uses more infrastructure which may be corrupted in crash situations, and allocates its resources dynamically) and hence less suited for crash analysis. JFR has been used in production for a decade to solve JVM issues. The overhead depends on the events being enabled. For instance, GC and compiler events require very little overhead. Probably not measurable. > Workflow is also different - our statistic is strictly terminal-oriented. JFR seems to be dialed more towards full blown java performance analysis, less to VM introspection. JFR was built for troubleshooing the JVM. Other features have been added over the years, but it is not necessary to enable them.

          Thomas Stuefe added a comment -
          Hi Erik,

          I think our statistical history plays in a completely different league from JFR. The intent of this feature is to be very simple, very cheap and very robust to the point that it can be left on by default and basically forgotten; it is designed specifically to continue working under crash situations when nothing else wouldn't (the history is in our port printed as part of error reporting - which once was its main purpose - and now also via jcmd). The current version of our statistics takes a few dozens kB as mem footprint (allocated upfront, so it never can run oom), cpu footprint is not measurable.

          I'm not sure how that compares to JFR. The latter strikes me as way more expensive, more fragile (since it uses more infrastructure which may be corrupted in crash situations, and allocates its resources dynamically) and hence less suited for crash analysis. Workflow is also different - our statistic is strictly terminal-oriented. JFR seems to be dialed more towards full blown java performance analysis, less to VM introspection.

          What I can see is that we may be able to share code at lower levels, e.g. parsing proc file systems. There are a lot of similarities.

          My patch is basically done and just ran successfully thru submit. I'll will post it for RFC shortly, then we can discuss this better.

          Thanks, Thomas

          Thomas Stuefe added a comment - Hi Erik, I think our statistical history plays in a completely different league from JFR. The intent of this feature is to be very simple, very cheap and very robust to the point that it can be left on by default and basically forgotten; it is designed specifically to continue working under crash situations when nothing else wouldn't (the history is in our port printed as part of error reporting - which once was its main purpose - and now also via jcmd). The current version of our statistics takes a few dozens kB as mem footprint (allocated upfront, so it never can run oom), cpu footprint is not measurable. I'm not sure how that compares to JFR. The latter strikes me as way more expensive, more fragile (since it uses more infrastructure which may be corrupted in crash situations, and allocates its resources dynamically) and hence less suited for crash analysis. Workflow is also different - our statistic is strictly terminal-oriented. JFR seems to be dialed more towards full blown java performance analysis, less to VM introspection. What I can see is that we may be able to share code at lower levels, e.g. parsing proc file systems. There are a lot of similarities. My patch is basically done and just ran successfully thru submit. I'll will post it for RFC shortly, then we can discuss this better. Thanks, Thomas

          Erik Gahlin added a comment - - edited
          Flight Recorder can almost be configured to do this. You would create three recordings. one for each time interval

          java
          -XX:StartFlightRecording:maxage=10d,settings=long.jfc
          -XX:StartFlightRecording:maxage=1d,settings=medium.jfc
          -XX:StartFlightRecording:maxage=60m,settings=short.jfc ...

          In the long recording you would set period for the CPULoad to 1 hour and in the short you would set it 1 second. What it would not be able to do is downsample the data. The long recording would retain data for ten days with data from the short recording, but maybe an option could be added so it could remove data, even though the maxage is 10 days.

          -XX:StartFlightRecording:maxage=10d,downsample=true
           

          Erik Gahlin added a comment - - edited Flight Recorder can almost be configured to do this. You would create three recordings. one for each time interval java -XX:StartFlightRecording:maxage=10d,settings=long.jfc -XX:StartFlightRecording:maxage=1d,settings=medium.jfc -XX:StartFlightRecording:maxage=60m,settings=short.jfc ... In the long recording you would set period for the CPULoad to 1 hour and in the short you would set it 1 second. What it would not be able to do is downsample the data. The long recording would retain data for ten days with data from the short recording, but maybe an option could be added so it could remove data, even though the maxage is 10 days. -XX:StartFlightRecording:maxage=10d,downsample=true  

          Thomas Stuefe added a comment -
          Please note that this issue is to track work on a prototype, and to get an issue number to get it run thru jdk-submit.

          Thomas Stuefe added a comment - Please note that this issue is to track work on a prototype, and to get an issue number to get it run thru jdk-submit.

            stuefe Thomas Stuefe
            stuefe Thomas Stuefe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: