RFR: 8280476: [macOS] : hotspot arm64 bug exposed by latest clang

Thu Feb 3 04:37:05 UTC 2022

On Wed, 2 Feb 2022 20:11:29 GMT, Daniel D. Daugherty <dcubed at openjdk.org> wrote:

>> First thing to note is that `replicate` is never actually called with `count == 0`. It's questionable whether it ever should be (especially as this function is really only meant to be used locally to this code) but logically I think the result ought to be 0 if it ever does get called that way.
>> 
>> What `replicate` is meant to do is replicate the bottom `nbits` bits of the `uint64_t` passed via argument `bits` across successive subfields of `result` up to `(count * bits)` worth of bits (sorry ... the names are somewhat inconveniently chosen as far as this discussion is concerned). So, anyway, a count of zero ought to insert 0 bits giving a zero result.
>> 
>> As to how it is used, `replicate` is only called from method `expandLogicalImmediate` with two alternative calling patterns. It is sometimes called like this
>> 
>>     for (int i = 0; i < 6; i++) {
>>       int nbits = 1 << i;
>>       . . .
>>         replicate(and_bit, 1, nbits)
>> 
>> n.b. argument `count` is being supplied as `nbits` and argument `nbits` is supplied as 1 (yeah, naming is hard). These calls serve to insert anywhere between 1 and 32 copies of a single bit into the rightmost portion of the result.
>> 
>> It is also called like this:
>> 
>>     for (int i = 0; i < 6; i++) {
>>       int nbits = 1 << i;
>>       . . .
>>           replicate(and_bits_top, 2 * nbits, 32 / nbits)
>> 
>> This is used to replicate a bit pattern taken from the lower end of the first input across the whole of a uint64_t result. For these calls the input arguments `count` and `nbits` satisfy the following invariants:
>> 
>>     nbits | 64
>>     count | 64
>>     nbits * count == 64
>>     2 <= nbits <= 32
>>     32 >= count >= 2
>> 
>> I am not actually clear why the caller is calling `replicate` in the way it does. The algorithm is way more complicated than it actually needs to be. The basic idea is to insert imms 1s into a 2^k bit field, right rotate them by immr and then replicate the result across the full 64 bits. There's a few wrinkles which are explained here for those who are interested:
>> 
>> expandLogicalImmediate(uint32_t immN, uint32_t immr, uint32_t imms, uint64_t& bimm)
>> 
>> -  outputs a bit pattern of width 64 by replicating a bit field of width 2^k and returns 1 or fails and returns 0
>> 
>> - when immN is 1 then k is 6 and immr/imms are masked to 6 bit integers.
>> - when immN is 0 then k is the count of the first 0 bit in imms and immr/imms are masked to k-bit integers (i.e. leading 1s in imms determine dead bits of imms and immr)
>> 
>> A k value of 0 (i.e. immN == 0 and imms = all 1s) fails and returns 0
>> 
>> After masking
>> 
>> An imms value of all 1s (i.e. 2^k - 1) fails and returns 0 (note given the previous check we only get here when  immN == 1)
>> 
>> The bit field can now be constructed as follows:
>> 
>>  - imms + 1 specifies the number of 1s to insert into the 2^k-field (note that imms + 1 < 2^k)
>>  - immr counts how far to right rotate the 2^k-field
>> 
>> The resulting 2^k-field is then replicated across the full 64-bit result.
>
> @adinn - Thanks for chiming in on this review thread.
> 
> I'm reading and re-reading your comments and it will take me a while
> to put it all in proper context.
> 
> I do agree with one of your points:
> 
>> `replicate` is never actually called with `count == 0`
> 
> so that means that this:
> 
>     `assert(count <= 1, "must be");`
> 
> is a little too loose and should change to:
> 
>     `assert(count == 1, "must be");`
> 
> if we're going to keep it at all. I think we should keep it, but
> I can be convinced otherwise since the original code had no
> similar `assert()`.
> 
> @kimbarrett - I've re-read your comments and I'm not clear
> on whether you think that `result <<= nbits` is UB when
> `nbits == 64`. Can you please clarify?
> 
> Of course, it's not helping matters that I can't reproduce the
> failure mode reported by Apple and we're getting no response
> about our request for version info.

It's definitely UB according to the language.  Then there's the question of what the hardware might actually implement in the cases where the compiler doesn't recognize it and the relevant instruction(s) are executed with that value.  I know I've seen hardware that used the shift quantity mod word-size.  I think I've seen hardware that would just shift out all the bits, leaving a value of 0.  One reason for the language to make this UB is because of hardware variance.  Oh, and I *think* I've seen hardware where it was UB and would produce "strange" results.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7270