Memory Corruption in JDK-21 on x86_64 with AVX-512

Vladimir Ivanov vladimir.x.ivanov at oracle.com
Thu Oct 5 03:24:15 UTC 2023


> I am curious for folks thoughts on how we could improve confidence in 
> the implementations of vector instructions for various architectures, as 
> I would expect their use to grow both with the number intrinsics added 
> for optimizations (e.g. https://bugs.openjdk.org/browse/JDK-8292158 
> <https://bugs.openjdk.org/browse/JDK-8292158>) and as the Vector API 
> (https://openjdk.org/jeps/8315945 <https://openjdk.org/jeps/8315945>) 
> matures and gains more usage. The permutations to test here are large -- 
> OS, architecture, processor instruction support, vector size, operation 
> and intrinsic enable/disabled flags, etc. that make this expensive in 
> terms of time, power, hardware, and opportunity costs. I would like to 
> understand if there are some ways we could help provide support from the 
> community and continue to push OpenJDK forward.

That's a good question, David.

Overall, Vector API tremendously improved testability of vector support 
in C2. It became straightforward to write unit tests on individual 
operations and Vector API implementation brought a rich set of unit 
tests with it.

But JDK-8317121 is of a different flavor: it's a problem in interaction 
between different parts of the compiler. Enumerating such interactions 
with unit tests is impractical due to huge number of possible combinations.

Test generators are a much better fit for such type of testing. Teaching 
existing fuzzers about vector operations (through Vector API) looks like 
the most promising way to find similar issues in the future.

Best regards,
Vladimir Ivanov

>> On Oct 4, 2023, at 11:50, Pengfei Li <Pengfei.Li at arm.com 
>> <mailto:Pengfei.Li at arm.com>> wrote:
>>
>> 
>>
>> Hi Vladimir,____
>>
>> __ __
>>
>> I just verified this issue on AArch64 with SVE and had below findings.____
>>
>> __ __
>>
>> a) This issue is also reproducible on AArch64 with “UseSVE=1”. By 
>> setting “-XX:UseSVE=0” it does **not** appear.____
>>
>> b) This issue does **not** appear with 
>> “-XX:ArrayOperationPartialInlineSize=0”, which turns off the 
>> generation of masked load/store for some array intrinsics.____
>>
>> c) This issue does **not** appear with “-XX:-DoEscapeAnalysis” or 
>> “-XX:-EliminateAllocations”, which disable stack allocation for new 
>> arrays.____
>>
>> __ __
>>
>> According to my initial analysis, the issue seems to be caused by some 
>> incorrect Ideal() transformation of masked load/store generated for 
>> vector masked array copy. So, I have another 2 suggestions besides 
>> Sandhya’s suggestion.____
>>
>> __ __
>>
>> 1) If you have decided to set UseAVX=2 by default in the release, 
>> UseSVE should be set to 0 by default for AArch64 as well.____
>>
>> 2) An alternative workaround for the issue is setting 
>> ArrayOperationPartialInlineSize to 0 by default on x86 and AArch64. In 
>> current JDK 21, array operation partial inlining is the only place to 
>> generate masked load/store besides VectorAPI. This kind of workaround 
>> brings much less performance lost than limiting the AVX/SVE level.____
>>
>> __ __
>>
>> --____
>>
>> Thanks,____
>>
>> Pengfei____
>>
>> __ __
>>
>> *From: *hotspot-compiler-dev <hotspot-compiler-dev-retn at openjdk.org 
>> <mailto:hotspot-compiler-dev-retn at openjdk.org>> on behalf of 
>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com 
>> <mailto:sandhya.viswanathan at intel.com>>
>> *Date: *Friday, September 29, 2023 at 20:17
>> *To: *Vladimir Kozlov <vladimir.kozlov at oracle.com 
>> <mailto:vladimir.kozlov at oracle.com>>, Carter Kozak <ckozak at ckozak.net 
>> <mailto:ckozak at ckozak.net>>, Xiaohong Gong <Xiaohong.Gong at arm.com 
>> <mailto:Xiaohong.Gong at arm.com>>, hotspot-compiler-dev at openjdk.org 
>> <mailto:hotspot-compiler-dev at openjdk.org> 
>> <hotspot-compiler-dev at openjdk.org 
>> <mailto:hotspot-compiler-dev at openjdk.org>>
>> *Cc: *Tobias Hartmann <tobias.hartmann at oracle.com 
>> <mailto:tobias.hartmann at oracle.com>>, Bhateja, Jatin 
>> <jatin.bhateja at intel.com <mailto:jatin.bhateja at intel.com>>
>> *Subject: *RE: Memory Corruption in JDK-21 on x86_64 with AVX-512____
>>
>> Hi Vladimir,
>>
>> In my humble opinion we have to go with the small fix which returns 
>> nullptr in both LoadVectorMaskedNode::Ideal and 
>> StoreVectorMaskedNode::Ideal.
>> This is the only near-term solution.
>> This is for 4 reasons:
>> 1) This is how it was before the 9037 PR. Majority of other changes in 
>> the 9037 PR are for SVE and not at all applicable to x86.
>> 2) This is what is suggested by the author of the 9037 PR as the fix.
>> 3) We have a regression test from the bug reproducer so the fix can be 
>> verified. The only inconvenience is that it is not in jtreg form.
>> 4) And the most important: Setting AVX to 2 doesn’t solve the problem 
>> for original bug reporter.  The bug reproducer has been using UseAVX=3 
>> on command line and will continue to see the issue. So, we don’t fix 
>> the problem at all by setting AVX to 2, not even hide it for this 
>> reporter and who knows how many others will have this silent issue on 
>> this platform and SVE platform.
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: Vladimir Kozlov <vladimir.kozlov at oracle.com 
>> <mailto:vladimir.kozlov at oracle.com>>
>> Sent: Thursday, September 28, 2023 7:29 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com 
>> <mailto:sandhya.viswanathan at intel.com>>; Carter Kozak 
>> <ckozak at ckozak.net <mailto:ckozak at ckozak.net>>; Xiaohong Gong 
>> <Xiaohong.Gong at arm.com <mailto:Xiaohong.Gong at arm.com>>; 
>> hotspot-compiler-dev at openjdk.org <mailto:hotspot-compiler-dev at openjdk.org>
>> Cc: Tobias Hartmann <tobias.hartmann at oracle.com 
>> <mailto:tobias.hartmann at oracle.com>>; Bhateja, Jatin 
>> <jatin.bhateja at intel.com <mailto:jatin.bhateja at intel.com>>
>> Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>
>> Thank you, Sandhya, for your suggestion.
>>
>> But even backing off such not small changes may introduce instability 
>> into release which is 2 weeks away.
>>
>> Disabling avx512 is only temporary. It will be restored together with 
>> fix in next update in 3 months.
>>
>> We will have release note about that so users will not be surprised by 
>> temporary performance lost.
>>
>> And finally, I don't think a lot of people will immediately switch to 
>> JDK 21 and be affected by this temporary change.
>>
>> This will give us time to work on fix and regression test without rush.
>>
>> Regards,
>> Vladimir K
>>
>> On 9/28/23 2:47 PM, Viswanathan, Sandhya wrote:
>> > I meant 9037 (https://github.com/openjdk/jdk/pull/9037 
>> <https://github.com/openjdk/jdk/pull/9037>). The PR was for SVE and 
>> not at all related to x86 architecture.
>> > 
>> > Best Regards,
>> > Sandhya
>> > 
>> > -----Original Message-----
>> > From: Viswanathan, Sandhya
>> > Sent: Thursday, September 28, 2023 2:25 PM
>> > To: Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>; 
>> Carter Kozak
>> > <ckozak at ckozak.net <mailto:ckozak at ckozak.net>>; Xiaohong Gong 
>> <Xiaohong.Gong at arm.com <mailto:Xiaohong.Gong at arm.com>>;
>> > hotspot-compiler-dev at openjdk.org <mailto:hotspot-compiler-dev at openjdk.org>
>> > Cc: Tobias Hartmann <tobias.hartmann at oracle.com <mailto:tobias.hartmann at oracle.com>>; 
>> Bhateja, Jatin
>> > <Jatin.Bhateja at intel.com <mailto:Jatin.Bhateja at intel.com>>
>> > Subject: RE: Memory Corruption in JDK-21 on x86_64 with AVX-512
>> > 
>> > Could we not back out 9307 instead, the offending PR causing the issue? Even the alternate fix discussed by Xiaohong is small. JDK 21 performance for many real-world applications on x86_64 will be much worse if we set AVX=2. This is not same as JDK 9.
>> > 
>> > Best Regards,
>> > Sandhya
>> > 
>> > -----Original Message-----
>> > From: Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>>
>> > Sent: Thursday, September 28, 2023 1:43 PM
>> > To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com <mailto:sandhya.viswanathan at intel.com>>; 
>> Carter Kozak
>> > <ckozak at ckozak.net <mailto:ckozak at ckozak.net>>; Xiaohong Gong 
>> <Xiaohong.Gong at arm.com <mailto:Xiaohong.Gong at arm.com>>;
>> > hotspot-compiler-dev at openjdk.org <mailto:hotspot-compiler-dev at openjdk.org>
>> > Cc: Tobias Hartmann <tobias.hartmann at oracle.com <mailto:tobias.hartmann at oracle.com>>; 
>> Bhateja, Jatin
>> > <jatin.bhateja at intel.com <mailto:jatin.bhateja at intel.com>>
>> > Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>> > 
>> > Hi,
>> > 
>> > After discussion we decided to postpone fix to 21.0.2.
>> > There is no much time left before 21.0.1 is released and we still don't have regression test and did not review or fully tested the fix.
>> > 
>> > But because of severity of the issue and to be safe we decided to set UseAVX == 2 by default even on AVX512 CPUs in 21.0.1. And add release note about this. It is temporary solution until next update release 21.0.2. We did similar thing (disabled avx512 instructions) in JDK 9.
>> > 
>> > I am also thinking about setting UseSVE to `false` by default on Aarch64 with SVE since it could be affected too. We don't have such machine so I can't verify the fix.
>> > 
>> > Meanwhile I will work on test and fix for JDK 22. And I hope Xiaohong will help, when he is back from break, with SVE testing.
>> > 
>> > Regards,
>> > Vladimir K
>> > 
>> > On 9/27/23 11:32 AM, Vladimir Kozlov wrote:
>> >> https://bugs.openjdk.org/browse/JDK-8317121 
>> <https://bugs.openjdk.org/browse/JDK-8317121>
>> >>
>> >> On 9/27/23 11:19 AM, Vladimir Kozlov wrote:
>> >>> Officially jdk21.0.1 is closed for fixes but we are discussing this 
>> >>> issue now.
>> >>>
>> >>> I verified the suggested by Xiaohong fix:
>> >>> retrun nullptr in StoreVectorNode::Ideal() and LoadVectorNode::Ideal().
>> >>>
>> >>> I will file a bug and follow the process.
>> >>>
>> >>> Regards,
>> >>> Vladimir K
>> >>>
>> >>> On 9/27/23 9:54 AM, Viswanathan, Sandhya wrote:
>> >>>> Copying Tobias.
>> >>>>
>> >>>>   From what I see JDK21 update schedule has a mid-October release 
>> >>>> plan (https://wiki.openjdk.org/display/JDKUpdates/JDK+21u 
>> <https://wiki.openjdk.org/display/JDKUpdates/JDK+21u>).
>> >>>>
>> >>>> Tobias, do you know the code freeze date for that. Hopefully, we 
>> >>>> still have a couple of days left.
>> >>>>
>> >>>> At the minimum we should revert to return nullptr in both 
>> >>>> LoadVectorMaskedNode::Ideal and StoreVectorMaskedNode::Ideal as a 
>> >>>> workaround.
>> >>>>
>> >>>> Best Regards,
>> >>>> Sandhya
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: hotspot-compiler-dev <hotspot-compiler-dev-retn at openjdk.org 
>> <mailto:hotspot-compiler-dev-retn at openjdk.org>>
>> >>>> On Behalf Of Carter Kozak
>> >>>> Sent: Wednesday, September 27, 2023 7:26 AM
>> >>>> To: Xiaohong Gong <Xiaohong.Gong at arm.com <mailto:Xiaohong.Gong at arm.com>>;
>> >>>> hotspot-compiler-dev at openjdk.org <mailto:hotspot-compiler-dev at openjdk.org>
>> >>>> Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>> >>>>
>> >>>> Thank you for such a quick diagnosis, I really appreciate it!
>> >>>>
>> >>>> While I don't have a good sense of the mapping between bug severity 
>> >>>> and backport/release criteria, I suspect this issue may be a strong 
>> >>>> candidate for an out-of-band jdk-21 release. I'm certainly biased 
>> >>>> because I had invalid bytes silently persisted, and defer to folks 
>> >>>> closer to the process.
>> >>>>
>> >>>> Best,
>> >>>> Carter Kozak
>> >>>>
>> >>>> On Wed, Sep 27, 2023, at 02:57, Xiaohong Gong wrote:
>> >>>>> Hi Carter,
>> >>>>> Thanks for reporting this issue!
>> >>>>> It seems the issue is caused by the calling of the super class’s 
>> >>>>> “Ideal()” method in “LoadVectorMaskedNode::Ideal()” (code is:
>> >>>>> 8286941: Add mask IR for partial vector operations for ARM SVE by 
>> >>>>> XiaohongGong · Pull Request #9037 · openjdk/jdk (github.com <http://github.com>)
>> >>>>> <https://github.com/openjdk/jdk/pull/9037/files#diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028 <https://github.com/openjdk/jdk/pull/9037/files#diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028>>).
>> >>>>> Some optimization applied on “LoadNode” is not suitable for 
>> >>>>> “LoadVectorMaskedNode” now. Reverting it back to “return nullptr”
>> >>>>> can make your benchmark pass.
>> >>>>> I will look at this issue and try to fix it next. Thanks!____
>>


More information about the hotspot-compiler-dev mailing list