RFR: 8279016: JFR Leak Profiler is broken with Shenandoah

Fri Jul 26 08:39:34 UTC 2024

On Thu, 25 Jul 2024 21:06:30 GMT, Erik Gahlin <egahlin at openjdk.org> wrote:

>> While testing unrelated Shenandoah patch, I caught a GC assert when Leak Profiler was running ([JDK-8337194](https://bugs.openjdk.org/browse/JDK-8337194)). 
>> 
>> Leak Profiler is notorious in using the mark words for its own needs. We have been trying to mitigate its impact on GCs by moving to separate bitsets for tracking marked objects, or by treating "marked without fwdptr" as "JFR marked" and handling it. But this is not reliable, since things like [putting indexes in mark word](https://github.com/openjdk/jdk/blob/3baff2af6a30cc6cb2e0d4391db1cf7e61c33f64/src/hotspot/share/jfr/leakprofiler/chains/edgeStore.cpp#L275-L280) sneak in. This is okay for Leak Profiler alone, since it restores the mark words after the operation completes, but that is still not enough when GC is already running.
>> 
>> I say we side-step this whack-a-mole by cleanly bailing from JFR op, when we know it is unsafe to do. I thought to use `VM_Operation::doit_prologue`, but I think GC start may sneak in between checking in prologue and op start.
>> 
>> This realistically only affects Shenandoah. All other STW collectors would never see what Leak Profiler did with mark words. ZGC would not see it, since it does not care about mark words for its own operation.
>> 
>> Additional testing:
>>  - [x] `jdk/jfr/event/oldobject/` pass by default (100x times)
>>  - [x] `jdk/jfr/event/oldobject/` pass with `-XX:+UseShenandoah` (1000x)
>
> test/jdk/jdk/jfr/event/oldobject/TestFieldInformation.java line 60:
> 
>> 58:             addToTestField();
>> 59: 
>> 60:             // Let GC catch up before we stop the recording and do the old object sample
> 
> We have worked hard to remove Thread.sleep, in the JFR tests, so we can run them in parallel, i.e. -conc:64, without introducing failures. I think the test was written before event streaming existed. The test could perhaps be rewritten to use an event stream, i.e:
> 
>     try (RecordingStream rs = new RecordingStream()) {
>       rs.enable(EventNames.OldObjectSample).withoutStackTrace().with("cutoff", "infinity");
>       rs.onEvent(e -> {
>         if (hasValidField(e)) {
>           rs.close();
>         }
>       });
>       rs.startAsync();
>       addToTestField();
>       rs.awaitTermination();
>     }

I agree using `sleep` here is not great. But I could not find a better way to do it.

When I was looking at this test, I noticed that `OldObjectSample` event is only emitted when recording is stopped: https://github.com/openjdk/jdk/blob/7f11935461a6ff1ef0fac800fb4264a519e21061/src/jdk.jfr/share/classes/jdk/jfr/internal/OldObjectSample.java#L41-L42 -- this is why I thought to add `sleep` right before we stop the recording.

So going async does not really help, because of this circularity: we only stop the recording when event is received, which we would not receive until recording is stopped. Sure enough, this times out:

    public static void main(String[] args) throws Exception {
        WhiteBox.setWriteAllObjectSamples(true);

        try (RecordingStream rs = new RecordingStream()) {
            rs.enable(EventNames.OldObjectSample).withoutStackTrace().with("cutoff", "infinity");
            rs.onEvent(e -> {
                try {
                    if (hasValidField(e)) {
                        rs.close();
                    }
                } catch (Exception ex) {
                    ex.printStackTrace();
                }
            });

            rs.startAsync();
            addToTestField();
            rs.awaitTermination();
        }
    }

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20328#discussion_r1692709095