[RFR]: Per thread IO statistics in JFR

Thu Jan 17 17:31:21 UTC 2019

On Thu, Jan 17, 2019 at 1:31 PM Alan Bateman <Alan.Bateman at oracle.com> wrote:
>
> On 17/01/2019 07:23, Thomas Stüfe wrote:
> > :
> >
> > Do you object against keeping these counters (which basically boils
> > down to Thread::current->stat_structure->counter++)? Or do you even
> > object against making upcalls into the jvm? Note that, if deemed
> > necessary, we could omit updating the counters unless JFR or our
> > extended thread dumps are activated (which are the consumers of the
> > counters).
> >
> > In any case, I would have assumed the costs for upcall + counter
> > update to be insignificant compared to the IO calls. We should of
> > course measure that.
> >
> > If you generally object upcalls into the libjvm for
> > statistical/monitoring reasons, this would make matters on a number of
> > fronts more complicated. For instance, it was discussed extending NMT
> > coverage to the JDK - which is already in part reality at
> > Unsafe.AllocateMemory - and this would have to be done with upcalls too.
> >
> There are many issues here that will need write-up and discussion, maybe
> a JEP if discussions converge on a proposal to bring into the main line
> as this is a significant change with implications for many areas of the
> platform.

We could certainly create a new JEP for this feature, but in the end
it is rather trivial. In fact it can be expressed in a single
sentence: instrument all native IO functions to collect the number of
read and written bytes. As stated before, the overhead is minimal
(Gunter will provide some concrete benchmarks soon) and could even be
dropped to ~zero if you really insist on this (e.g. by wrapping all
calls by a switch which is based on a system property). But honestly
speaking we don't think this is really necessary.

I want to stress the fact that this is not a wacky idea Gunter came up
with but a feature which we're using for almost 10 years in our SAP
JVM. It is running in all our enterprise scenarios without ever having
caused any problems.

> It also potentially conflicts in direction with some of the
> other projects in progress (particularly with Loom trying to re-imagine
> threads, do you really want to collect I/O stats on a per thread basis
> in the future???).
>

Who knows about the future, our trace is quite useful today :) You're
right that this may change when Loom becomes available and I think
we've faced similar problems when people started using ThreadPools
more heavily. We've mitigated this with another SAP JVM specific
feature (which we havn't contributed yet) that we've called "Thread
Annotations". They could be set from user code and our profiler could
then split the collected statistics based on these annotations even if
they were all collected for the same native thread. I image we can
come up with similar solutions for project Loom in the future, once it
becomes available.

> As regards the points to instrument then I think we have to assume that
> much of the native code that is targeted by the current webrev will go
> away or change significantly in the future.

But that's actually nice! That will eventually shrink the code
proposed by this change to a few places and make the costs of
supporting it even smaller. And needless to say that SAP is committed
to support and maintain this feature once it gets integrated.

> We've been on that path for
> some time, e.g. the zip area or the prototype to replace the SocketImpl
> used for classic networking that eliminates a lot of the native code
> touched in that area by the webrev. Once Panama is further along then I
> assume we will want to make use of it in the core libraries and at least
> initially replace the JNI methods that just wrap syscalls today, and
> longer term more significant refactoring.

Again, I think this will make our implementation simpler because we
would have to instrument fewer places.

> My point is that instrumenting
> native methods may not be the right approach, instead maybe we should be
> look at instrumenting the I/O paths at the java level as that will
> likely play better with the VM.

We gathered some experience with Java-level instrumentation during the
last years (e.g. Dynatrace, Wiley, etc) and it has not always been
positive :)  The overhead is certainly higher compared to the native
instrumentation at C-level. And it can influence especially JIT
optimizations in an unpredictable way. Finally the Java
instrumentation is version dependent and puts the burden on
instrumenting the "right" places (end especially catching all the
right places) on the developer / support engineer. You could of course
upload this overhead into another, external project like JFR, but
supporting that doesn't come for free either.

> There is some support for collecting I/O
> stats in JFR today and maybe someone working in that area can explain
> that a bit more and what the issues are.
>

As far as I understand, the corresponding JFR events only sample the
time spent in the instrumented calls while we do an exact IO
statistics with regard to the bytes read and written.

> It's impossible to tell from the mail with the webrev what has been
> explored and not explored. It feels like early stages in a much large
> project that will need a write up of prototypes before a direction can
> be proposed.

As I wrote before, this is a solution we're using successfully since
many years. We're convinced that is is superior to Java-level
instrumentation with regards to both, overhead as well as
maintainability.

>
> -Alan