[External] : Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
Xiaohong Gong
Xiaohong.Gong at arm.com
Fri Oct 6 10:02:05 UTC 2023
Thanks all for the quick fixing and verifying!
Best Regards,
Xiaohong
-----Original Message-----
From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
Sent: Thursday, October 5, 2023 2:08 PM
To: Pengfei Li <Pengfei.Li at arm.com>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Carter Kozak <ckozak at ckozak.net>; Xiaohong Gong <Xiaohong.Gong at arm.com>; hotspot-compiler-dev at openjdk.org
Cc: Tobias Hartmann <tobias.hartmann at oracle.com>; Bhateja, Jatin <jatin.bhateja at intel.com>; nd <nd at arm.com>
Subject: Re: [External] : Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
Thank you, Pengfei, for verifying the fix.
Vladimir K
On 10/4/23 8:16 PM, Pengfei Li wrote:
> Just verified it on AArch64 SVE. It looks good. Thank you for the quick fix.
>
> --
>
> Thanks,
>
> Pengfei
>
> *From: *Vladimir Kozlov <vladimir.kozlov at oracle.com>
> *Date: *Thursday, October 5, 2023 at 00:59
> *To: *Pengfei Li <Pengfei.Li at arm.com>, Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>, Carter Kozak <ckozak at ckozak.net>,
> Xiaohong Gong <Xiaohong.Gong at arm.com>,
> hotspot-compiler-dev at openjdk.org <hotspot-compiler-dev at openjdk.org>
> *Cc: *Tobias Hartmann <tobias.hartmann at oracle.com>, Bhateja, Jatin
> <jatin.bhateja at intel.com>, nd <nd at arm.com>
> *Subject: *Re: [External] : Re: Memory Corruption in JDK-21 on x86_64
> with AVX-512
>
> Thank you, Pengfei
>
> We investigate and found the cause [1]. I pushed fix into JDK 22 and I
> just pushed backport into 21u.
>
> Please, try it with SVE Aarch64.
>
> thanks,
> Vladimir K
>
> [1] https://bugs.openjdk.org/browse/JDK-8317121
> <https://bugs.openjdk.org/browse/JDK-8317121>
>
> On 10/4/23 8:08 AM, Pengfei Li wrote:
>> Hi Vladimir,
>>
>> I just verified this issue on AArch64 with SVE and had below findings.
>>
>> a) This issue is also reproducible on AArch64 with “UseSVE=1”. By
>> setting “-XX:UseSVE=0” it does **not** appear.
>>
>> b) This issue does **not** appear with
>> “-XX:ArrayOperationPartialInlineSize=0”, which turns off the
>> generation of masked load/store for some array intrinsics.
>>
>> c) This issue does **not** appear with “-XX:-DoEscapeAnalysis” or
>> “-XX:-EliminateAllocations”, which disable stack allocation for new arrays.
>>
>> According to my initial analysis, the issue seems to be caused by
>> some incorrect Ideal() transformation of masked load/store generated
>> for vector masked array copy. So, I have another 2 suggestions
>> besides Sandhya’s suggestion.
>>
>> 1) If you have decided to set UseAVX=2 by default in the release,
>> UseSVE should be set to 0 by default for AArch64 as well.
>>
>> 2) An alternative workaround for the issue is setting
>> ArrayOperationPartialInlineSize to 0 by default on x86 and AArch64.
>> In current JDK 21, array operation partial inlining is the only place
>> to generate masked load/store besides VectorAPI. This kind of
>> workaround brings much less performance lost than limiting the AVX/SVE level.
>>
>> --
>>
>> Thanks,
>>
>> Pengfei
>>
>> *From: *hotspot-compiler-dev <hotspot-compiler-dev-retn at openjdk.org>
>> on behalf of Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> *Date: *Friday, September 29, 2023 at 20:17
>> *To: *Vladimir Kozlov <vladimir.kozlov at oracle.com>, Carter Kozak
>> <ckozak at ckozak.net>, Xiaohong Gong <Xiaohong.Gong at arm.com>,
>> hotspot-compiler-dev at openjdk.org <hotspot-compiler-dev at openjdk.org>
>> *Cc: *Tobias Hartmann <tobias.hartmann at oracle.com>, Bhateja, Jatin
>> <jatin.bhateja at intel.com>
>> *Subject: *RE: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>
>> Hi Vladimir,
>>
>> In my humble opinion we have to go with the small fix which returns
>> nullptr in both LoadVectorMaskedNode::Ideal and
>> StoreVectorMaskedNode::Ideal.
>> This is the only near-term solution.
>> This is for 4 reasons:
>> 1) This is how it was before the 9037 PR. Majority of other changes
>> in the 9037 PR are for SVE and not at all applicable to x86.
>> 2) This is what is suggested by the author of the 9037 PR as the fix.
>> 3) We have a regression test from the bug reproducer so the fix can
>> be verified. The only inconvenience is that it is not in jtreg form.
>> 4) And the most important: Setting AVX to 2 doesn’t solve the problem
>> for original bug reporter. The bug reproducer has been using
>> UseAVX=3 on command line and will continue to see the issue. So, we
>> don’t fix the problem at all by setting AVX to 2, not even hide it
>> for this reporter and who knows how many others will have this silent
>> issue on this platform and SVE platform.
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
>> Sent: Thursday, September 28, 2023 7:29 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Carter
>> Kozak <ckozak at ckozak.net>; Xiaohong Gong <Xiaohong.Gong at arm.com>;
>> hotspot-compiler-dev at openjdk.org
>> Cc: Tobias Hartmann <tobias.hartmann at oracle.com>; Bhateja, Jatin
>> <jatin.bhateja at intel.com>
>> Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>
>> Thank you, Sandhya, for your suggestion.
>>
>> But even backing off such not small changes may introduce instability
>> into release which is 2 weeks away.
>>
>> Disabling avx512 is only temporary. It will be restored together with
>> fix in next update in 3 months.
>>
>> We will have release note about that so users will not be surprised
>> by temporary performance lost.
>>
>> And finally, I don't think a lot of people will immediately switch to
>> JDK 21 and be affected by this temporary change.
>>
>> This will give us time to work on fix and regression test without rush.
>>
>> Regards,
>> Vladimir K
>>
>> On 9/28/23 2:47 PM, Viswanathan, Sandhya wrote:
>>> I meant 9037 (https://github.com/openjdk/jdk/pull/9037
> <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9037_
> _;!!ACWV5N9M2RV99hQ!NDSFggTfqw7XvQlMaHlHqWEyiZhReOX54ipPlPTJJYacXLPEyJ
> N9m_7SzIsdLbSbDYxpqMZJ8ANLbaTzU5IfabE$>
>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9037__;!!ACWV5N9M2RV99hQ!M8EKHUMHb9bo84KGR41CgFwDjxsQ3fqDPf4mkD64GNZPu0IIrzlQ-WoVxCd5d84Exc77lqyWJzWCX1fudlBGJPU$ <https://urldefense.com/v3/__https:/github.com/openjdk/jdk/pull/9037__;!!ACWV5N9M2RV99hQ!M8EKHUMHb9bo84KGR41CgFwDjxsQ3fqDPf4mkD64GNZPu0IIrzlQ-WoVxCd5d84Exc77lqyWJzWCX1fudlBGJPU$>>). The PR was for SVE and not at all related to x86 architecture.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>> -----Original Message-----
>>> From: Viswanathan, Sandhya
>>> Sent: Thursday, September 28, 2023 2:25 PM
>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; Carter Kozak
>>> <ckozak at ckozak.net>; Xiaohong Gong <Xiaohong.Gong at arm.com>;
>>> hotspot-compiler-dev at openjdk.org
>>> Cc: Tobias Hartmann <tobias.hartmann at oracle.com>; Bhateja, Jatin
>>> <Jatin.Bhateja at intel.com>
>>> Subject: RE: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>>
>>> Could we not back out 9307 instead, the offending PR causing the issue? Even the alternate fix discussed by Xiaohong is small. JDK 21 performance for many real-world applications on x86_64 will be much worse if we set AVX=2. This is not same as JDK 9.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
>>> Sent: Thursday, September 28, 2023 1:43 PM
>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Carter
>>> Kozak <ckozak at ckozak.net>; Xiaohong Gong <Xiaohong.Gong at arm.com>;
>>> hotspot-compiler-dev at openjdk.org
>>> Cc: Tobias Hartmann <tobias.hartmann at oracle.com>; Bhateja, Jatin
>>> <jatin.bhateja at intel.com>
>>> Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>>
>>> Hi,
>>>
>>> After discussion we decided to postpone fix to 21.0.2.
>>> There is no much time left before 21.0.1 is released and we still don't have regression test and did not review or fully tested the fix.
>>>
>>> But because of severity of the issue and to be safe we decided to set UseAVX == 2 by default even on AVX512 CPUs in 21.0.1. And add release note about this. It is temporary solution until next update release 21.0.2. We did similar thing (disabled avx512 instructions) in JDK 9.
>>>
>>> I am also thinking about setting UseSVE to `false` by default on Aarch64 with SVE since it could be affected too. We don't have such machine so I can't verify the fix.
>>>
>>> Meanwhile I will work on test and fix for JDK 22. And I hope Xiaohong will help, when he is back from break, with SVE testing.
>>>
>>> Regards,
>>> Vladimir K
>>>
>>> On 9/27/23 11:32 AM, Vladimir Kozlov wrote:
>>>> https://bugs.openjdk.org/browse/JDK-8317121
> <https://bugs.openjdk.org/browse/JDK-8317121>
>> <https://bugs.openjdk.org/browse/JDK-8317121
> <https://bugs.openjdk.org/browse/JDK-8317121>>
>>>>
>>>> On 9/27/23 11:19 AM, Vladimir Kozlov wrote:
>>>>> Officially jdk21.0.1 is closed for fixes but we are discussing
>>>>> this issue now.
>>>>>
>>>>> I verified the suggested by Xiaohong fix:
>>>>> retrun nullptr in StoreVectorNode::Ideal() and LoadVectorNode::Ideal().
>>>>>
>>>>> I will file a bug and follow the process.
>>>>>
>>>>> Regards,
>>>>> Vladimir K
>>>>>
>>>>> On 9/27/23 9:54 AM, Viswanathan, Sandhya wrote:
>>>>>> Copying Tobias.
>>>>>>
>>>>>> From what I see JDK21 update schedule has a mid-October release
>>>>>>plan (https://wiki.openjdk.org/display/JDKUpdates/JDK+21u
>> <https://wiki.openjdk.org/display/JDKUpdates/JDK+21u
> <https://wiki.openjdk.org/display/JDKUpdates/JDK+21u>>).
>>>>>>
>>>>>> Tobias, do you know the code freeze date for that. Hopefully, we
>>>>>> still have a couple of days left.
>>>>>>
>>>>>> At the minimum we should revert to return nullptr in both
>>>>>> LoadVectorMaskedNode::Ideal and StoreVectorMaskedNode::Ideal as a
>>>>>> workaround.
>>>>>>
>>>>>> Best Regards,
>>>>>> Sandhya
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: hotspot-compiler-dev
>>>>>> <hotspot-compiler-dev-retn at openjdk.org>
>>>>>> On Behalf Of Carter Kozak
>>>>>> Sent: Wednesday, September 27, 2023 7:26 AM
>>>>>> To: Xiaohong Gong <Xiaohong.Gong at arm.com>;
>>>>>> hotspot-compiler-dev at openjdk.org
>>>>>> Subject: Re: Memory Corruption in JDK-21 on x86_64 with AVX-512
>>>>>>
>>>>>> Thank you for such a quick diagnosis, I really appreciate it!
>>>>>>
>>>>>> While I don't have a good sense of the mapping between bug
>>>>>> severity and backport/release criteria, I suspect this issue may
>>>>>> be a strong candidate for an out-of-band jdk-21 release. I'm
>>>>>> certainly biased because I had invalid bytes silently persisted,
>>>>>> and defer to folks closer to the process.
>>>>>>
>>>>>> Best,
>>>>>> Carter Kozak
>>>>>>
>>>>>> On Wed, Sep 27, 2023, at 02:57, Xiaohong Gong wrote:
>>>>>>> Hi Carter,
>>>>>>> Thanks for reporting this issue!
>>>>>>> It seems the issue is caused by the calling of the super class’s
>>>>>>> “Ideal()” method in “LoadVectorMaskedNode::Ideal()” (code is:
>>>>>>> 8286941: Add mask IR for partial vector operations for ARM SVE
>>>>>>> by XiaohongGong · Pull Request #9037 · openjdk/jdk (github.com)
>>>>>>> <https://github.com/openjdk/jdk/pull/9037/files#diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028 <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9037/files*diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028__;Iw!!ACWV5N9M2RV99hQ!NDSFggTfqw7XvQlMaHlHqWEyiZhReOX54ipPlPTJJYacXLPEyJN9m_7SzIsdLbSbDYxpqMZJ8ANLbaTzQeAsMaU$> <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9037/files*diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028__;Iw!!ACWV5N9M2RV99hQ!M8EKHUMHb9bo84KGR41CgFwDjxsQ3fqDPf4mkD64GNZPu0IIrzlQ-WoVxCd5d84Exc77lqyWJzWCX1fuKTlF184$ <https://urldefense.com/v3/__https:/github.com/openjdk/jdk/pull/9037/files*diff-692826251cae892bc4737919579c6afbd317551cd507f99c7bd29d585c1282e2R1028__;Iw!!ACWV5N9M2RV99hQ!M8EKHUMHb9bo84KGR41CgFwDjxsQ3fqDPf4mkD64GNZPu0IIrzlQ-WoVxCd5d84Exc77lqyWJzWCX1fuKTlF184$>>>).
>>>>>>> Some optimization applied on “LoadNode” is not suitable for
>>>>>>> “LoadVectorMaskedNode” now. Reverting it back to “return nullptr”
>>>>>>> can make your benchmark pass.
>>>>>>> I will look at this issue and try to fix it next. Thanks!
>>
>
More information about the hotspot-compiler-dev
mailing list