Improvement for lowered throughput when allocation tracking is enabled

Fri Jul 31 09:14:05 UTC 2020

Hi Erik,

Thanks a lot for the background info -- it does indeed sound like both
reducing the cost of stack comparisons as well as downsampling would
largely resolve the performance issues with allocation tracking.

I've prepared a webrev with the improved hash code implementation
here: https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/webrev/

There's no related jira entry as I don't have permissions to log a ticket.

For reference, I'll also give a bit more background on the
benchmarking that I did. I was testing this change with the tip of the
JDK14 development branch, running on a Linux virtual box configured
with 6 cores. The benchmark program (which I unfortunately can't
share) was a somewhat simplified version of a production workload, and
has 6 "hot" worker threads with a large number of distinct stack
states and a high allocation rate. The production workload where this
issue was first encountered was using 15 cores and 15 worker threads,
which I think makes the situation quite a bit worse.

I monitored the throughput of this test workload in 3 scenarios:  (1)
with JFR disabled, (2) JFR enabled, and (3) JFR enabled with the
improved hash. I've got a chart showing the throughput rates here
(sorry for the unlabeled axes)
https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/m1_operation_rates.png

It's quite curious that the version with JFR enabled but with the
improved hash has a higher throughput rate than with JFR disabled -- I
can't explain that, but I _assume_ it's not directly related to this
change, and instead due to a changed memory layout that is brought on
by the change (unless there is some JFR code that is always running in
the background?). I encountered this same behavior over multiple runs,
and also with an alternate (but similar) improved hash implementation.

In any case, the profiles that I get using async-profiler show a
clearer story. I've focused on the JFR TLAB allocation tracking in the
worker threads in the below screenshots:

itimer profile with JFR enabled:
https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/itimer_profile.png
itimer profile with JFR enabled with improved hash:
https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/itimer_profile_improved_hash.png

wallclock profile with JFR enabled:
https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/wallclock_profile.png
wallclock profile with JFR enabled with improved hash:
https://gabrielreid.github.io/dev/openjdk/jfr_improved_hash/wallclock_profile_improved_hash.png

In the itimer flamegraph screenshots it's pretty clear to see the
additional time spent in JfrStackTrace::equals, and in the wallclock
screenshot the lock contention is quite visible.

I was a bit concerned that this behaviour could be largely
coincidental due to specifics of the workload being tested, but
writing some other test programs with a large number of distinct deep
stacks showed the same behavior, also with about 10x the number of
hash collisions for non-equal stack traces (determined by just logging
hash collisions that result in equality tests returning false).

Another interesting tidbit was that I noticed that due to the current
hash implementation (where the lower-order bits are discarded from the
method id), it's quite easy to create a test case of (almost)
identical deep stacks that originate from two adjacent methods in the
same class -- in this case, the hash will be the same, but it's a
worst-case scenario in terms of determining that the two stacks are
not equal.

- Gabriel

On Thu, Jul 30, 2020 at 4:38 PM Erik Gahlin <erik.gahlin at oracle.com> wrote:
>
> Hi Gabriel,
>
> We had problem with the hash code in the past and it could probably be improved further. I wasn’t aware of significant lock contention.
>
> It would be great if you could share some code!
>
> That said, there is work being planned/discussed that will improve the situation when it comes to allocation profiling and stack traces
>
> 1. JEP 376: ZGC: Concurrent Thread-Stack Processing [1] will add a capability so we won’t need to walk the whole stack when comparing a stack trace, only up to a watermark. When this is implemented, we will probably have to redo the hash algorithm
>
> 2. We are planning/discussing a new ObjectAllocation event that will try to downsample the number of events, i.e 50 per second. This will allow allocation sampling to be enabled by default. Today we only recommend the TLAB events for shorter period of time due the number of events and the associated overhead.
>
> There are more ideas, but the above will probably (no guarantees) be integrated in the next year (if it reduces the overhead)
>
> Thanks
> Erik
>
> [1] https://openjdk.java.net/jeps/376
>
> > On 30 Jul 2020, at 12:47, Gabriel Reid <gabriel.reid at gmail.com> wrote:
> >
> > Hi,
> >
> > We recently ran into an issue of significantly reduced throughput in a
> > workload when running with JFR enabled, specifically related to the
> > enabling of allocation tracking (i.e. ObjectAllocationInNewTLAB and
> > ObjectAllocationOutsideTLAB). The workload in question is CPU-bound,
> > parallel (running on 15 cores), with a high allocation rate, and has
> > many distinct call stack states. This situation has been encountered
> > both in OpenJDK 13 and OpenJDK 14 (14.0.1+7), and can also be
> > reproduced current HEAD of the OpenJDK source tree.
> >
> > The symptom that we’ve encountered is that throughput is lowered, and
> > also CPU utilization is lowered when allocation tracking is enabled.
> > Profiling with java-async-profiler it appears that a significant
> > amount of time is spent in JfrStackTrace::equals [1], and the calls to
> > this method are performed within the context of a mutex, which
> > explains the lowered CPU utilization. Profiling in wall clock mode
> > confirms that a significant amount of time is spent waiting on the
> > mutex in JfrStackTraceRepository::add_trace [2].
> >
> > I dug into this a bit deeper, and it appears that this is largely
> > related to the hashcode calculation performed for JfrStackTrace [3],
> > as it appears to be very susceptible to collisions. Based on testing
> > with a reduced test workload on 6 cores, I’ve seen that replacing this
> > hash calculation with a more thorough implementation can nearly
> > eliminate the impact of allocation tracking (which otherwise results
> > in a lowering of 15% in throughput for this test workload).
> >
> > It’s worth noting that the issue doesn’t seem to be with the
> > distribution of the hash codes over buckets, but more with the reduced
> > cardinality of the actual hash values. Because the hash code is used
> > as a short-circuit for equality comparison[1], having hash collisions
> > on stack traces with equal height results in a potentially costly
> > comparison of all frames until a non-matching frame is found.
> >
> > I’d be very happy to submit a patch for this (as well as provide more
> > background on my benchmarking if desired), but I wanted to first check
> > if I was missing something obvious here (i.e. if there’s a specific
> > reason for the hash calculation to be done as it currently is). Could
> > you let me know if I’m missing some important details here, or should
> > I go ahead with submitting a patch?
> >
> > - Gabriel
> >
> > 1. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#l106
> > 2. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTraceRepository.cpp#l177
> > 3. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#l204
>