RFR: 8257621: JFR StringPool misses cached items across consecutive recordings

Fri Dec 4 21:36:12 UTC 2020

On Fri, 4 Dec 2020 17:47:54 GMT, Erik Gahlin <egahlin at openjdk.org> wrote:

>> The problem scenario in more detail is:
>> 
>> StringPool has a Java side cache for Strings that meet the caching requirements. jfrStringPool.hpp/cpp is the native accompaniment that is called to write strings into the constant pool if a string is submitted to the Java layer cache that does not already exist, or is submitted for a *new* epoch.
>> 
>> Start recording A, at epoch 0. Emit 2 events with a String "s". This does nto exist in the Java cache, so is cached and pushed into the native layer to emit as a constant pool. Stop recording A. The event will have message=1, and constant pool will have java.lang.String entry:  `1="s"`.
>> 
>> Start recording B, still at epoch 0. Emit 2 events with a String "s". The Java layer cache still contains "s" for epoch 0. No request is made to push the string to the native layer to emit as a constant pool. Stop recording B. The event will have message=1, but there will be no constant pool entry for `1`.
>> 
>> My fix in more detail is:
>> 
>> Reset the Java layer cache across the recording boundary, specifically when a recording is stopped and there are no other running recordings. I was also thinking about rotating the epoch at this boundary, but I'm not sure what impacts it would have with the other jfr sub-systems.
>> 
>> I have also tested this fix in the simple "concurrent recording" case where I start recording A, start recording B, stop recording A and stop recording B.
>
> I'm not sure either, Markus knows the epoch system the best, but I wouldn't change it.
> 
> We like to clear the cache (to reduce heap usage) when a physical recording ends, so I think your fix is fine for that reason alone. Possibly there should be a call to reset in PlatformRecorder::destroy() as well. It is however only called in the shutdown hook, so in practise it will probably not matter.

Hi Jie,

I think what you have run into is something that I feel a bit guilty about...

So, it used to be the case that almost every subsystem (in the VM) maintained a local, i.e. to that subsystem, state. After introducing the epoch mechanism as the general "state progression", it was possible to consolidate almost all local states to use the general one. This works well, because all VM subsystems are notified / track when the epoch changes...

So this leads to the StringPool located in java...it used to be the case that the corresponding StringPool in native maintained a local state, a monotonically increasing integer value that was then read from the java side to detect when the system progressed. When i modified to use epoch, i did not think about the fact that the Java side does not get notified correctly, and a boolean will overflow quite quickly compared to the original integer...bahh

I think the correct way to fix this is to reinstate the original integer that increments on epoch shifts, and have that be read again from the java side, then there should not be overflowing happens that trick the system to thinking it is actually in the same epoch. This will avoid more cumbersome notification logic that would have to extend to java. It is also something that we had working before. 

With Loom, there will be a general counter incrementing tracking the epoch / generation state, so we could easily re-route the local StringPool state to the global one when Loom comes in...

Sorry for this inconvenience and thank you for finding this - i will suggest the code changes we could do to solve this.

Markus

-------------

PR: https://git.openjdk.java.net/jdk/pull/1576