RFR: 8289688: jfr command hangs when it processes invalid file
Erik Gahlin
egahlin at openjdk.org
Sun Jul 24 09:05:02 UTC 2022
On Sat, 23 Jul 2022 15:51:27 GMT, Erik Gahlin <egahlin at openjdk.org> wrote:
>> jfr command was in infinite loop when I passed invalid file which are in JFR repository of GraalVM Native Image (I attached the recording file to JBS).
>>
>> CPU usage was 100% when jfr hanged, and I got following thread stacks:
>>
>>
>> "main" #1 prio=5 os_prio=0 cpu=10333.35ms elapsed=10.67s tid=0x00007f4994013810 nid=10079 runnable [0x00007f499adfe000]
>> java.lang.Thread.State: RUNNABLE
>> at java.io.RandomAccessFile.seek0([java.base at 18.0.1](mailto:java.base at 18.0.1)/Native Method)
>> at java.io.RandomAccessFile.seek([java.base at 18.0.1](mailto:java.base at 18.0.1)/RandomAccessFile.java:591)
>> at jdk.jfr.internal.consumer.RecordingInput.positionPhysical([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/RecordingInput.java:102)
>> at jdk.jfr.internal.consumer.ChunkHeader.refresh([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/ChunkHeader.java:124)
>> at jdk.jfr.internal.consumer.ChunkHeader.<init>([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/ChunkHeader.java:108)
>> at jdk.jfr.internal.consumer.ChunkHeader.<init>([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/ChunkHeader.java:66)
>> at jdk.jfr.internal.consumer.ChunkParser.<init>([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/ChunkParser.java:117)
>> at jdk.jfr.internal.consumer.ChunkParser.<init>([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/ChunkParser.java:113)
>> at jdk.jfr.consumer.RecordingFile.findNext([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/RecordingFile.java:256)
>> at jdk.jfr.consumer.RecordingFile.<init>([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/RecordingFile.java:89)
>> at jdk.jfr.internal.tool.EventPrintWriter.print([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/EventPrintWriter.java:72)
>> at jdk.jfr.internal.tool.Print.execute([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/Print.java:164)
>> at jdk.jfr.internal.tool.Main.main([jdk.jfr at 18.0.1](mailto:jdk.jfr at 18.0.1)/Main.java:87)
>>
>>
>> And also I got following log with `-J-Xlog:"jfr*=info"` - chunk size was 0, and the parser hanged in processing first chunk (id = 0):
>>
>>
>> $ jfr -J-Xlog:"jfr*=info" print ~/2022_07_04_18_05_11.jfr
>> [0.353s][info][jfr,system,parser] Chunk: 0
>> [0.353s][info][jfr,system,parser] Chunk: file=/home/ysuenaga/2022_07_04_18_05_11.jfr
>> [0.353s][info][jfr,system,parser] Chunk: startPosition=0
>> [0.353s][info][jfr,system,parser] Chunk: major=2
>> [0.353s][info][jfr,system,parser] Chunk: minor=0
>> [0.353s][info][jfr,system,parser] Chunk: chunkSize=0
>> [0.353s][info][jfr,system,parser] Chunk: constantPoolPosition=0
>> [0.353s][info][jfr,system,parser] Chunk: metadataPosition=0
>> [0.353s][info][jfr,system,parser] Chunk: startNanos=0
>> [0.353s][info][jfr,system,parser] Chunk: durationNanos=0
>> [0.354s][info][jfr,system,parser] Chunk: startTicks=9410187
>> [0.354s][info][jfr,system,parser] Chunk: ticksPerSecond=1000000000
>>
>>
>> We should abort the process when invalid chunk was detected.
>
> There is a limit of 1000 iterations, unless the recordings is opened using the EventStream or RecordingStream API, which is not the case with the 'jfr' command.
>
> It was a compromise, allowing unfinished recording to be parsed using the RecordingFile API (used by the 'jfr' command) in the disk repository, but it might fail if the JVM is not able to write to the chunk header within 1-2 s.
>
> https://github.com/openjdk/jdk/blob/0599a05f8c7e26d4acae0b2cc805a65bdd6c6f67/src/jdk.jfr/share/classes/jdk/jfr/internal/consumer/RecordingInput.java#L451
> @egahlin I pushed new commit to this PR. How about this? It adds loop limit like `RecordingInput::pollWait`.
>
> I confirmed the recording file from Native Image has `0` in `flagByte` (it is called as `feature` in HotSpot) - so root cause of the hang is `metadataPosition` is zero (we can see it in UL). This change can avoid infinite loop in that case.
This will break event streaming where we might need to spin indefinitely, or at least much longer than a few seconds. If the metadata offset has not been written by the JVM, we need to wait for it. Every recording chunk must have a metadata section for it to be parsable.
In the bug you write "I passed invalid file which are in JFR repository of GraalVM Native Image".
Did you do this on a live process, or did you copy the recording file out?
If it is live, the metadata position should be written eventually, or there is a bug when it is written, and should be fixed there.
If you copied the recording file out yourself, while it is being written, and passed it to 'jfr print', that is not supported.
-------------
PR: https://git.openjdk.org/jdk/pull/9363
More information about the hotspot-jfr-dev
mailing list