Proposal: Always-on Statistical History

Wed Nov 14 14:57:47 UTC 2018

Hi all,

We have that feature in our port which we would like to contribute,
and I would like to gauge opinions.

First off, I am not sure which list is correct. This is more of a
serviceability issue, but implementation wise it fit hs-runtime
better. I'll start with serviceability, but feel free crosspost if
needed.

Second, I am aware that this may require a JEP. If necessary and the
feedback is positive, I will draft one.

----

In our port we have something called "Statistics History". Basically
this is a rolling history, spanning up to 10 days, of a number of key
values. Key values range from JVM specifics like heap size, metaspace
size, number of threads etc, to platform specifics like memory
footprint, cpu load, io- and swapping activity etc.

A periodic tasks collects those values, in - by default - 15 second
intervals. They are then fed into a FIFO. FIFO spans 10 days. To save
memory that FIFO is downsampled in two steps, so we have the last n
hours in high resolution and the last n days in low resolution (of
course all these parameters are configurable).

The history report can be triggered via jcmd, and also could get
printed in the hs.err file (open for debate).

---

Here some examples of how the whole thing looks like:

http://cr.openjdk.java.net/~stuefe/webrevs/stathist/examples/stathist-volker.txt

http://cr.openjdk.java.net/~stuefe/webrevs/stathist/examples/stathist-s390x.txt

---

This feature has been really popular with our support folk over the
years. Be it that the VM is starved for resources by the OS, that we
have some slow- or fast developing leak situation etc: these values
are a first and easy way to get a first stab at a situation, before we
start more expensive analysis.

The explicit design goal of this history was to be very cheap - cheap
enough to be *always on* and getting forgotten. It is, in our port,
enabled by default. That way, if a problem occurs at a customer site,
we immediately see developments spanning the last 10 days, without
having to reproduce the issue.

It is also robust enough to be usable during error reporting without
endangering the error reporting process or falsifying the picture.

I am aware that this crosses over into JFR territory. But this feature
does not attempt to replace JFR, it is intended instead a cheap always
on first stop historical overview.

--

I have a patch which can be applied atop of jdk12:

http://cr.openjdk.java.net/~stuefe/webrevs/stathist/stathist.patch

It works, passes our nightlies and no regressions are shown in dapapo
benchmarks.

Please tell me what you think. Given enough interest, I will attempt
to contribute (drafting a JEP if necessary.)

Thanks and Kind Regards,

Thomas