Improvement for lowered throughput when allocation tracking is enabled

Thu Jul 30 14:38:17 UTC 2020

Hi Gabriel,

We had problem with the hash code in the past and it could probably be improved further. I wasn’t aware of significant lock contention. 

It would be great if you could share some code!

That said, there is work being planned/discussed that will improve the situation when it comes to allocation profiling and stack traces

1. JEP 376: ZGC: Concurrent Thread-Stack Processing [1] will add a capability so we won’t need to walk the whole stack when comparing a stack trace, only up to a watermark. When this is implemented, we will probably have to redo the hash algorithm

2. We are planning/discussing a new ObjectAllocation event that will try to downsample the number of events, i.e 50 per second. This will allow allocation sampling to be enabled by default. Today we only recommend the TLAB events for shorter period of time due the number of events and the associated overhead.

There are more ideas, but the above will probably (no guarantees) be integrated in the next year (if it reduces the overhead)

Thanks
Erik

[1] https://openjdk.java.net/jeps/376

> On 30 Jul 2020, at 12:47, Gabriel Reid <gabriel.reid at gmail.com> wrote:
> 
> Hi,
> 
> We recently ran into an issue of significantly reduced throughput in a
> workload when running with JFR enabled, specifically related to the
> enabling of allocation tracking (i.e. ObjectAllocationInNewTLAB and
> ObjectAllocationOutsideTLAB). The workload in question is CPU-bound,
> parallel (running on 15 cores), with a high allocation rate, and has
> many distinct call stack states. This situation has been encountered
> both in OpenJDK 13 and OpenJDK 14 (14.0.1+7), and can also be
> reproduced current HEAD of the OpenJDK source tree.
> 
> The symptom that we’ve encountered is that throughput is lowered, and
> also CPU utilization is lowered when allocation tracking is enabled.
> Profiling with java-async-profiler it appears that a significant
> amount of time is spent in JfrStackTrace::equals [1], and the calls to
> this method are performed within the context of a mutex, which
> explains the lowered CPU utilization. Profiling in wall clock mode
> confirms that a significant amount of time is spent waiting on the
> mutex in JfrStackTraceRepository::add_trace [2].
> 
> I dug into this a bit deeper, and it appears that this is largely
> related to the hashcode calculation performed for JfrStackTrace [3],
> as it appears to be very susceptible to collisions. Based on testing
> with a reduced test workload on 6 cores, I’ve seen that replacing this
> hash calculation with a more thorough implementation can nearly
> eliminate the impact of allocation tracking (which otherwise results
> in a lowering of 15% in throughput for this test workload).
> 
> It’s worth noting that the issue doesn’t seem to be with the
> distribution of the hash codes over buckets, but more with the reduced
> cardinality of the actual hash values. Because the hash code is used
> as a short-circuit for equality comparison[1], having hash collisions
> on stack traces with equal height results in a potentially costly
> comparison of all frames until a non-matching frame is found.
> 
> I’d be very happy to submit a patch for this (as well as provide more
> background on my benchmarking if desired), but I wanted to first check
> if I was missing something obvious here (i.e. if there’s a specific
> reason for the hash calculation to be done as it currently is). Could
> you let me know if I’m missing some important details here, or should
> I go ahead with submitting a patch?
> 
> - Gabriel
> 
> 1. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#l106
> 2. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTraceRepository.cpp#l177
> 3. https://hg.openjdk.java.net/jdk/jdk14/file/6c954123ee8d/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#l204