generational zgc issues

Erik Osterlund erik.osterlund at oracle.com
Fri Dec 1 15:35:10 UTC 2023


Hi Alen,

I would strongly discourage trying to make sense of HotSpot internal pointer representations. They are a very private implementation detail that is very much subject to change and rather likely to change quite a bit between releases.

The offset of the address in generational ZGC pointers depends on a number of things. On some CPU architectures, there is a constant offset to the address bits, but not always. You are presumably running on x86_64, where the offset of the address bits depends on which one of the four remap bits is set, due to implementation details of the barriers, that I don’t think user code should reason about at all. On AArch64 the remap bits are typically flipped except for null, so that one is cleared and the other 3 are set.

Trying to make sense of our pointer layout is probably the best almost guaranteed way of shooting yourself in the foot with a shotgun. The question isn't whether it will break, but rather how long it will take before it breaks, and on what CPU architecture it breaks. In this particular case, I don’t think it would take very long at all before any reasoning about this is going to be invalid and cause the next breakage.

This kind of demonstrates why we are trying to push for encapsulating the JDK. To avoid this kind of hacks that occur when libraries silently break past abstraction layers and tries to reason about how the JVM works with Unsafe hacks. How the JVM works is fundamentally an ever changing story, where all assumptions are ticking bombs waiting to blow up.

/Erik

On 1 Dec 2023, at 16:01, Alen Vrečko <alen.vrecko at gmail.com> wrote:

Thanks Stefan for looking into it as well.

For the address, hm, I don't get it. For the non generational case, method works without shaving off the colors in the address. But it doesn't work with generational.
I printed out some of the pointers.

non-generational pointers:

0000000000000000000100000000000000000111110001100000111001011000
0000000000000000000100000000000000000111110001100001001000001000
0000000000000000000100000000000000000111110001100001001000011000
0000000000000000000100000000000000000111110001100001010111001000

generational
0000000010000000000000001111001000101100000000000001010100010000
0000000010000000000000001111001000101100000000100001010100010000
0000000010000000000000001111001000101100011101100001010100010000
0000000010000000000000001111001000101100011110000001010100010000

So the 2 low order bytes are the ZGC meta, the barriers and colors right?
And the high order bytes are the heap base bit (part) + offset?

I tried just shaving off the 2 high order bytes but not seeing the correct results. Is there more to extracting the address than just shaving off the 2 low order bytes?

Thanks
Alen


V V pet., 1. dec. 2023 ob 13:36 je oseba Stefan Karlsson <stefan.karlsson at oracle.com<mailto:stefan.karlsson at oracle.com>> napisala:
Hi Alen,

I'm glad that you figured out what was happening. FWIW, I ran a whole bunch of tests on Alma 9.2 and couldn't reproduce any issues.

Cheers,
StefanK

On 2023-11-29 19:37, Alen Vrečko wrote:
Hi Stefan,

all good. Finally got around to it. My bad in both cases.

o) adding System.gc() solved the problem. Indeed, not a good idea to have expectations when working with java.lang.ref.Cleaner. Preferably not use it at all.

o) for the corrupted byte[], got a chance to look into it. Not just speculate on log output. The issue was in Java Object Layout library (used v0.10). It returned something like 500K for the size of an object if Generational is enabled (should be in the range of < 100B). This caused a failure while processing byte[] and why I assumed that the byte[] is corrupted. I updated the jol library to 0.17 and it works fine now. Interesting that it looks like JOL v0.10 works fine on CentOS 7 with generational but not Alma 9.2 with generational - same 21 jdk.

Time to fix some bad first impressions.

Thanks
Alen

V V pon., 13. nov. 2023 ob 22:21 je oseba Alen Vrečko <alen.vrecko at gmail.com<mailto:alen.vrecko at gmail.com>> napisala:
Thanks for the fast reply Stefan.

For the reference issue. Looks like I misunderstood. Most probably issue with timing in the toy program with major collections. For both G1 and ZGC (non generational) both counters for new Foo() and Cleaner(foo)#clean match after a short while. But not for generational ZGC. I'll add System.gc() call in there and see what happens. Most probably a non-issue then and a misunderstanding on my part.

For the corrupted byte[]. Will see how much time I have on my hands to look into it. Like mentioned vanilla ZGC works fine, with generational ZGC seeing funny stuff with byte[].

Alen

V V pon., 13. nov. 2023 ob 20:28 je oseba Stefan Karlsson <stefan.karlsson at oracle.com<mailto:stefan.karlsson at oracle.com>> napisala:
Hi Alen,

On 2023-11-13 19:05, Alen Vrečko wrote:
Hello everyone,

o) young gen reference processor

A bit puzzled by reading in a thread on the list:

> mentioning that we decided to not ship a young generation reference processor for 21
Unless you made changes to ByteBuffer#allocateDirect it uses reference processor to free native memory. If I am not mistaking just using standard library API such as Files.readAllBytes will in some cases do BB#allocateDirect in the internals.
Or maybe I am misunderstanding something? I made a toy program and indeed I could easily get a situation where 20% of reference handlers are not called like ever.
This will cause issues for code that is using reference handlers.

The reference processing will happen when the GC performs a major collection, which collects both the young and old generation. If you add a System.gc() you should see that the reference processor is kicking in for your program. Could you share your toy program?

o) seeing weird byte[] corruption in production
On CentOS 7 Generational works fine. No issues observed. But on Alma Linux 9.2 either reading byte[] from file or sending byte[] over the network corrupts the byte[]. Didn't investigate at all. Just observed corruption in some cases for some byte[] arrays - not all - just some. On the same Alma Linux 9.2 without generational zgc no byte[] corruption is observed and everything works fine as before.

It's hard to say if this is a ZGC bug, compiler bug, OS bug, etc. Here are some suggestions for how to help pin-point the problem:
1) Could you provide the output from 'java -version'?
2) Is it possible to reproduce this with a small reproducer?
3) What CPU is this running on?
4) Does it happen with -XX:UseAVX=0
5) Do you know the sizes of the corrupted byte[]s? Do you know the offset to where it is corrupted?

StefanK

To me Generational ZGC looks more like an experimental feature for now. I am a bit surprised it doesn't require the extra flag to unlock experimental features.
Thanks
Alen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20231201/59195175/attachment-0001.htm>


More information about the zgc-dev mailing list