RFR: 8299546: C2: MulLNode::mul_ring() wrongly returns bottom type due to casting errors with large numbers [v2]

Tue Jan 10 16:58:04 UTC 2023

On Tue, 10 Jan 2023 14:27:39 GMT, Christian Hagedorn <chagedorn at openjdk.org> wrote:

>> The current logic in `MulLNode::mul_ring()` casts all `jlong` values of the involved type ranges of a multiplication to `double` in order to catch overflows when multiplying the two type ranges. This works fine for values in the `jlong` range that are not larger than 2<sup>53</sup> or lower than -2<sup>53</sup>. For numbers outside that range, we could experience precision errors because these numbers cannot be represented precisely due to the nature of how doubles are represented with a 52 bit mantissa. For example, the number 2<sup>53</sup> and 2<sup>53</sup> + 1 both have the same `double` representation of 2<sup>53</sup>.
>> 
>> In `MulLNode::mul_ring()`, we could do a multiplication with a `lo` or `hi` value of a type that is larger than  2<sup>53</sup> (or smaller than -2<sup>53</sup>). In this case, we might get a different result compared to doing the same multiplication with `jlong` values (even though there is no overflow/underflow). As a result, we return `TypeLong::LONG` (bottom type) and missed an optimization opportunity (e.g. folding an `If` when using the `MulL` node in the condition etc.).
>> 
>> This was caught by the new verification code added to CCP in [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197) which checks that after CCP, we should not get a different type anymore when calling `Value()` on a node. In the found fuzzer testcase, we run into the precision problem described above for a `MulL` node and set the type to bottom during CCP (even though there was no actual overflow). Since the type is bottom, we do not re-add the node to the CCP worklist because the premise is that types only go from top to bottom during CCP. Afterwards, an input type of the `MulL` node is updated again in such a way that the previously imprecise `double` multiplication in `mul_ring()` is now exact (by coincidence). We then hit the "missed optimization opportunity" assert added by JDK-8257197.
>> 
>> To fix this problem, I suggest to switch from a `jlong` - > `double` multiplication overflow check to an overflow check without casting. I've used the idea that `x = a * b` is the same as `b = x / a` (for `a != 0` and `!(a = -1 && b = MIN_VALUE)`)  which is also applied in `Math.multiplyExact()`: https://github.com/openjdk/jdk/blob/66db0bb6a15310e4e60ff1e33d40e03c52c4eca8/src/java.base/share/classes/java/lang/Math.java#L1022-L1036
>> 
>> The code of `MulLNode::mul_ring()` is almost identical to `MulINode::mul_ring()`. I've refactored that into a template class in order to share the code and simplified the overflow checking by using `MIN/MAX4` instead of using nested `if/else` statements.
>> 
>> Thanks,
>> Christian
>
> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision:
> 
>   review

Today's processors support 64-to-128-bit multiply in just a few cycles.  This is a useful operation here and elsewhere.  We should bite the bullet and arrange to make it available in HotSpot.  It would make this particular problem a little simpler to solve.

Here is a good external example of it being done well in fast, portable code:

https://github.com/Cyan4973/xxHash/blob/8e5fdcbe70687573265b7154515567ee7ca0645c/xxh3.h#L294

Here is an example (not to be followed literally) of code that has recently worked well for me on both x86 and ARM:

typedef __int128 uint128_t;
inline uint128_t make_uint128(uint64_t hi, uint64_t lo) {
  uint128_t hi128 = hi, lo128 = lo;
  if ((hi128 | lo128) >> 64)  abort();  // please no sign extension bugs
  return (hi128 << 64) | lo128;
}
inline void take_uint128(uint64_t& hi_out, uint64_t& lo_out, uint128_t x) {
  hi_out = (uint64_t)(x >> 64);
  lo_out = (uint64_t)(x >> 0);
}
inline void full_multiply(uint64_t& hi_out, uint64_t& lo_out, uint64_t x, uint64_t y) {
  take_uint128(hi_out, lo_out, make_uint128(0, x) * make_uint128(0, y));
}

There should be both signed and unsigned versions of this or a similar primitive.

I suggest making the two-output thing (full-multiply, signed or unsigned) to be the primitive for HotSpot, not the mul-hi intrinsic that Java has (high-half of full-multiply, signed or unsigned).  Java tilts strongly towards one-output primitives because we don't have Project Valhalla yet, but C++ has no such restriction.

-------------

PR: https://git.openjdk.org/jdk/pull/11907