[RFR]: Per thread IO statistics in JFR

Wed Jan 23 15:59:45 UTC 2019

Hi Alan, David

Here are the results of the measurements. As mentioned before, I had measured some of the standard Java benchmarks and there was no impact. This is probably not surprising to anybody as the overhead up-calls is minuscule in comparison to the calculations that take place and is far below the noise level.

A benchmark that is most prone to show the impact the up-calls would do fast (short) IO operations. The fastest operations I could think of were writing to /dev/null and reading from /dev/zero on Linux or macOS. I wrote benchmarks which write/read just one byte one million times in a loop using the jmh framework (https://openjdk.java.net/projects/code-tools/jmh/). 

The results measured on my mac were all over the place and I moved to a otherwise empty Linux host. Results were more stable there. jmh runs the benchmark multiple times for each invocation and calculates an error within one invocation. It turned out that invocations which had a large error showed poorer performance, obviously because something happened on the host in parallel what degraded performance. I invoked the benchmarks multiple times and decided to discard measurements with an error of more than 0.01. This value is rather random but it was achieved by about one out of four invocations and I think we can get rid of the outliers.

It turned out that even with this setup the impact of the up-call is below noise level. In fact, the write benchmark was faster with the up-calls what is not plausible, of course. The read benchmark showed the best and the poorest performance without the up-calls while the results for a JDK with the up-calls (see below for the actual results).

As Volker writes, we still think that the overhead of the up-calls is acceptable but it would be possible to introduce a switch which is based on a system property.

@Thomas, thanks for your input! I think all your suggestions are well funded. I'll reply in more detail if/when we can alleviate the more general concerns.

Best regards,
Gunter

Write without up-call
Mode  Cnt  Score   Error  Units
thrpt   25  1.867   0.005  ops/s
thrpt   25  1.866   0.006  ops/s
thrpt   25  1.865   0.004  ops/s
thrpt   25  1.860   0.010  ops/s

Write with up-call
Mode  Cnt  Score   Error  Units
thrpt   25  1.880   0.010  ops/s
thrpt   25  1.884   0.007  ops/s
thrpt   25  1.887   0.004  ops/s
thrpt   25  1.879   0.010  ops/s

Read without up-calls
Mode  Cnt  Score   Error  Units
thrpt   25  2.145   0.008  ops/s
thrpt   25  2.145   0.010  ops/s
thrpt   25  2.129   0.010  ops/s
thrpt   25  2.122   0.009  ops/s

Read with up-calls
Mode  Cnt  Score   Error  Units
thrpt   25  2.136   0.007  ops/s
thrpt   25  2.132   0.003  ops/s
thrpt   25  2.130   0.006  ops/s
thrpt   25  2.129   0.009  ops/s

On 17.01.19, 18:31, "Volker Simonis" <volker.simonis at gmail.com> wrote:

    On Thu, Jan 17, 2019 at 1:31 PM Alan Bateman <Alan.Bateman at oracle.com> wrote:
    >
    > On 17/01/2019 07:23, Thomas Stüfe wrote:
    > > :
    > >
    > > Do you object against keeping these counters (which basically boils
    > > down to Thread::current->stat_structure->counter++)? Or do you even
    > > object against making upcalls into the jvm? Note that, if deemed
    > > necessary, we could omit updating the counters unless JFR or our
    > > extended thread dumps are activated (which are the consumers of the
    > > counters).
    > >
    > > In any case, I would have assumed the costs for upcall + counter
    > > update to be insignificant compared to the IO calls. We should of
    > > course measure that.
    > >
    > > If you generally object upcalls into the libjvm for
    > > statistical/monitoring reasons, this would make matters on a number of
    > > fronts more complicated. For instance, it was discussed extending NMT
    > > coverage to the JDK - which is already in part reality at
    > > Unsafe.AllocateMemory - and this would have to be done with upcalls too.
    > >
    > There are many issues here that will need write-up and discussion, maybe
    > a JEP if discussions converge on a proposal to bring into the main line
    > as this is a significant change with implications for many areas of the
    > platform.

    We could certainly create a new JEP for this feature, but in the end
    it is rather trivial. In fact it can be expressed in a single
    sentence: instrument all native IO functions to collect the number of
    read and written bytes. As stated before, the overhead is minimal
    (Gunter will provide some concrete benchmarks soon) and could even be
    dropped to ~zero if you really insist on this (e.g. by wrapping all
    calls by a switch which is based on a system property). But honestly
    speaking we don't think this is really necessary.

    I want to stress the fact that this is not a wacky idea Gunter came up
    with but a feature which we're using for almost 10 years in our SAP
    JVM. It is running in all our enterprise scenarios without ever having
    caused any problems.

    > It also potentially conflicts in direction with some of the
    > other projects in progress (particularly with Loom trying to re-imagine
    > threads, do you really want to collect I/O stats on a per thread basis
    > in the future???).
    >

    Who knows about the future, our trace is quite useful today :) You're
    right that this may change when Loom becomes available and I think
    we've faced similar problems when people started using ThreadPools
    more heavily. We've mitigated this with another SAP JVM specific
    feature (which we havn't contributed yet) that we've called "Thread
    Annotations". They could be set from user code and our profiler could
    then split the collected statistics based on these annotations even if
    they were all collected for the same native thread. I image we can
    come up with similar solutions for project Loom in the future, once it
    becomes available.

    > As regards the points to instrument then I think we have to assume that
    > much of the native code that is targeted by the current webrev will go
    > away or change significantly in the future.

    But that's actually nice! That will eventually shrink the code
    proposed by this change to a few places and make the costs of
    supporting it even smaller. And needless to say that SAP is committed
    to support and maintain this feature once it gets integrated.

    > We've been on that path for
    > some time, e.g. the zip area or the prototype to replace the SocketImpl
    > used for classic networking that eliminates a lot of the native code
    > touched in that area by the webrev. Once Panama is further along then I
    > assume we will want to make use of it in the core libraries and at least
    > initially replace the JNI methods that just wrap syscalls today, and
    > longer term more significant refactoring.

    Again, I think this will make our implementation simpler because we
    would have to instrument fewer places.

    > My point is that instrumenting
    > native methods may not be the right approach, instead maybe we should be
    > look at instrumenting the I/O paths at the java level as that will
    > likely play better with the VM.

    We gathered some experience with Java-level instrumentation during the
    last years (e.g. Dynatrace, Wiley, etc) and it has not always been
    positive :)  The overhead is certainly higher compared to the native
    instrumentation at C-level. And it can influence especially JIT
    optimizations in an unpredictable way. Finally the Java
    instrumentation is version dependent and puts the burden on
    instrumenting the "right" places (end especially catching all the
    right places) on the developer / support engineer. You could of course
    upload this overhead into another, external project like JFR, but
    supporting that doesn't come for free either.

    > There is some support for collecting I/O
    > stats in JFR today and maybe someone working in that area can explain
    > that a bit more and what the issues are.
    >

    As far as I understand, the corresponding JFR events only sample the
    time spent in the instrumented calls while we do an exact IO
    statistics with regard to the bytes read and written.

    > It's impossible to tell from the mail with the webrev what has been
    > explored and not explored. It feels like early stages in a much large
    > project that will need a write up of prototypes before a direction can
    > be proposed.

    As I wrote before, this is a solution we're using successfully since
    many years. We're convinced that is is superior to Java-level
    instrumentation with regards to both, overhead as well as
    maintainability.

    >
    > -Alan