RFR: JDK-8289551: Conversions between bit representations of half precision values and floats

Wed Jul 13 14:18:12 UTC 2022

On Fri, 8 Jul 2022 06:11:22 GMT, Joe Darcy <darcy at openjdk.org> wrote:

> Initial implementation.

src/java.base/share/classes/java/lang/Float.java line 1044:

> 1042:         }
> 1043: 
> 1044:         assert -14 <= bin16Exp  && bin16Exp <= 15;

assert -15 < bin16Exp  && bin16Exp < 16;

is perhaps more readable because the code above uses -15 and 16: less mental calculation at no runtime costs ;-)

src/java.base/share/classes/java/lang/Float.java line 1056:

> 1054:                    // formats
> 1055:                    (bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11)));
> 1056:         return sign * Float.intBitsToFloat(result);

int result = (floatExpBits |
                   // Shift left difference in the number of
                   // significand bits in the float and binary16
                   // formats
                   (bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11)));

avoids a useless `|` operation

src/java.base/share/classes/java/lang/Float.java line 1090:

> 1088:     public static short floatToBinary16AsShortBits(float f) {
> 1089:         if (Float.isNaN(f)) {
> 1090:             // Arbitrary binary16 NaN value; could try to preserve the

// Arbitrary binary16 quiet NaN value; could try to preserve the

src/java.base/share/classes/java/lang/Float.java line 1100:

> 1098: 
> 1099:         // The overflow threshold is binary16 MAX_VALUE + 1/2 ulp
> 1100:         if (abs_f > (65504.0f + 16.0f) ) {

if (abs_f >= (65504.0f + 16.0f) ) {

Value exactly halfway must round to infinity.

src/java.base/share/classes/java/lang/Float.java line 1124:

> 1122:                 // 2^(-125) -- since (-125 = -149 - (-24)) -- so that
> 1123:                 // the trailing bits of a subnormal float represent
> 1124:                 // the correct trailing bits of a binary16 subnormal.

I would write intervals (ranges) in the form `[low, high]`, so `[-24, -15]` and `[-149, -140]`.

src/java.base/share/classes/java/lang/Float.java line 1127:

> 1125:                 exp = -15; // Subnormal encoding using -E_max.
> 1126:                 float f_adjust = abs_f * 0x1.0p-125f;
> 1127:                 signif_bits = (short)(Float.floatToRawIntBits(f_adjust) & 0x03ff);

I think the `if` and the `exp++` can be avoided if the `& 0x03ff` is dropped altogether.

               signif_bits = (short)Float.floatToRawIntBits(f_adjust);

The reason is the same as for the normalized case below: a carry will eventually flow into the representation for the exponent.

src/java.base/share/classes/java/lang/Float.java line 1141:

> 1139: 
> 1140:                 // Significand bits as if using rounding to zero (truncation).
> 1141:                 signif_bits = (short)((doppel & 0x0007f_e000) >>

signif_bits = (short)((doppel & 0x007f_e000) >>

or even

               signif_bits = (short)((doppel & 0x007f_ffff) >>

32 bit hex are more readable when they have 8 hex digits

src/java.base/share/classes/java/lang/Float.java line 1163:

> 1161:                 int round =  doppel & 0x00000_1000;
> 1162:                 int sticky = doppel & 0x00000_0fff;
> 1163: 

int lsb    = doppel & 0x0000_2000;
                int round  = doppel & 0x0000_1000;
                int sticky = doppel & 0x0000_0fff;

As above, these are 32 bit hex constants and should have at most 8 hex digits.

src/java.base/share/classes/java/lang/Float.java line 1166:

> 1164:                 if (((lsb == 0) && (round != 0) && (sticky != 0)) ||
> 1165:                     ( lsb != 0  &&  round != 0 ) ) { // sticky not needed
> 1166:                     // Due to the representational properties, an

if (round != 0 && (sticky != 0 || lsb != 0)) {

is more succinct.

src/java.base/share/classes/java/lang/Float.java line 1174:

> 1172: 
> 1173:             short result = 0;
> 1174:             result = (short)(((exp + 15) << 10) | signif_bits);

result = (short)(((exp + 15) << 10) + signif_bits);

The final exponent needs to be incremented when `signif_bits == 0x400`. The `|` is not enough for this to happen.

src/java.base/share/classes/java/lang/Float.java line 1175:

> 1173:             short result = 0;
> 1174:             result = (short)(((exp + 15) << 10) | signif_bits);
> 1175:             return (short)(sign_bit | (0x7fff & result));

return (short)(sign_bit | result);

because `result <= 0x7fff`.

-------------

PR: https://git.openjdk.org/jdk/pull/9422