[jdk16] RFR: 8227695: assert(pss->trim_ticks().seconds() == 0.0) failed: Unexpected partial trimming during evacuation
Thomas Schatzl
tschatzl at openjdk.java.net
Thu Jan 21 18:22:16 UTC 2021
On Thu, 21 Jan 2021 11:28:55 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:
>> Hi all,
>>
>> can I have reviews for this change that fixes an assert that checks whether there is no "trimming" action to be stable.
>>
>> We found that only on Windows Server 2012 and 2016 (not 2019) on many AMD Epyc machines sometimes
>>
>> pss->trim_ticks().seconds() == 0.0```
>>
>> fails on random tests. The `seconds()` methods is
>>
>> return (double)value * ((double)unit / (double)TimeSource::frequency());```
>>
>> where value is always zero, and `unit` and `TimeSource::frequency()` some constant integers, i.e.
>>
>> `(double) 0 * ((double) 1 / (double) 1000...000)`
>>
>> does not equal `0.0`.
>>
>> Code like this:
>>
>> double tt = pss->trim_ticks().seconds();
>> assert(tt == 0.0, ".... %2.f " PTR_FORMAT, tt, julong_cast(tt));
>> gives something like:
>>
>> `assert(tt == 0.0," .... 0.0 0x00000....0000"`
>>
>> so somehow the bit pattern 0x00...000 does not compare to FP 0.0.
>>
>> I've investigated this quite a bit (littering the code with this assert) with no particular result except that it somehow seems to have something to do with the `QueryPerformanceCounter()` call as most of the time the assert happens right after taking time.
>> Dumping FP+XMM register state (via `fxsave`) right after the comparison `tt == 0.0` goes wrong did not yield anything (to me) obviously wrong (still `val1` and `val2` of this `Tickspan` are zero).
>>
>> There is no known issue with release code (crashes in this or other particular locations), just that the failures are very annoying in the CI.
>>
>> The fix changes the FP comparison to an integer comparison which should have been done initially (there is also precedent in the code that does exactly this integer comparison for the same reason which never failed so far), but I/we could not explain why binary 0x00..00 is not always FP "0.0".
>>
>> Testing: After in total 8k iterations of two tests that seemed to cause this issue more than usual there has been no assertion failure (4k runs with this patch, 4k runs with this assert duplicated all over the place). hs-tier1-5
>
> I've been following Thomas's long investigation of this, and this change looks good to me.
Thanks @kimbarrett @fisk @walulyai for your reviews.
-------------
PR: https://git.openjdk.java.net/jdk16/pull/128
More information about the hotspot-gc-dev
mailing list