RFR: JDK-8289551: Conversions between bit representations of half precision values and floats
Raffaello Giulietti
duke at openjdk.org
Wed Jul 13 14:18:12 UTC 2022
On Fri, 8 Jul 2022 06:11:22 GMT, Joe Darcy <darcy at openjdk.org> wrote:
> Initial implementation.
src/java.base/share/classes/java/lang/Float.java line 1044:
> 1042: }
> 1043:
> 1044: assert -14 <= bin16Exp && bin16Exp <= 15;
assert -15 < bin16Exp && bin16Exp < 16;
is perhaps more readable because the code above uses -15 and 16: less mental calculation at no runtime costs ;-)
src/java.base/share/classes/java/lang/Float.java line 1056:
> 1054: // formats
> 1055: (bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11)));
> 1056: return sign * Float.intBitsToFloat(result);
int result = (floatExpBits |
// Shift left difference in the number of
// significand bits in the float and binary16
// formats
(bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11)));
avoids a useless `|` operation
src/java.base/share/classes/java/lang/Float.java line 1090:
> 1088: public static short floatToBinary16AsShortBits(float f) {
> 1089: if (Float.isNaN(f)) {
> 1090: // Arbitrary binary16 NaN value; could try to preserve the
// Arbitrary binary16 quiet NaN value; could try to preserve the
src/java.base/share/classes/java/lang/Float.java line 1100:
> 1098:
> 1099: // The overflow threshold is binary16 MAX_VALUE + 1/2 ulp
> 1100: if (abs_f > (65504.0f + 16.0f) ) {
if (abs_f >= (65504.0f + 16.0f) ) {
Value exactly halfway must round to infinity.
src/java.base/share/classes/java/lang/Float.java line 1124:
> 1122: // 2^(-125) -- since (-125 = -149 - (-24)) -- so that
> 1123: // the trailing bits of a subnormal float represent
> 1124: // the correct trailing bits of a binary16 subnormal.
I would write intervals (ranges) in the form `[low, high]`, so `[-24, -15]` and `[-149, -140]`.
src/java.base/share/classes/java/lang/Float.java line 1127:
> 1125: exp = -15; // Subnormal encoding using -E_max.
> 1126: float f_adjust = abs_f * 0x1.0p-125f;
> 1127: signif_bits = (short)(Float.floatToRawIntBits(f_adjust) & 0x03ff);
I think the `if` and the `exp++` can be avoided if the `& 0x03ff` is dropped altogether.
signif_bits = (short)Float.floatToRawIntBits(f_adjust);
The reason is the same as for the normalized case below: a carry will eventually flow into the representation for the exponent.
src/java.base/share/classes/java/lang/Float.java line 1141:
> 1139:
> 1140: // Significand bits as if using rounding to zero (truncation).
> 1141: signif_bits = (short)((doppel & 0x0007f_e000) >>
signif_bits = (short)((doppel & 0x007f_e000) >>
or even
signif_bits = (short)((doppel & 0x007f_ffff) >>
32 bit hex are more readable when they have 8 hex digits
src/java.base/share/classes/java/lang/Float.java line 1163:
> 1161: int round = doppel & 0x00000_1000;
> 1162: int sticky = doppel & 0x00000_0fff;
> 1163:
int lsb = doppel & 0x0000_2000;
int round = doppel & 0x0000_1000;
int sticky = doppel & 0x0000_0fff;
As above, these are 32 bit hex constants and should have at most 8 hex digits.
src/java.base/share/classes/java/lang/Float.java line 1166:
> 1164: if (((lsb == 0) && (round != 0) && (sticky != 0)) ||
> 1165: ( lsb != 0 && round != 0 ) ) { // sticky not needed
> 1166: // Due to the representational properties, an
if (round != 0 && (sticky != 0 || lsb != 0)) {
is more succinct.
src/java.base/share/classes/java/lang/Float.java line 1174:
> 1172:
> 1173: short result = 0;
> 1174: result = (short)(((exp + 15) << 10) | signif_bits);
result = (short)(((exp + 15) << 10) + signif_bits);
The final exponent needs to be incremented when `signif_bits == 0x400`. The `|` is not enough for this to happen.
src/java.base/share/classes/java/lang/Float.java line 1175:
> 1173: short result = 0;
> 1174: result = (short)(((exp + 15) << 10) | signif_bits);
> 1175: return (short)(sign_bit | (0x7fff & result));
return (short)(sign_bit | result);
because `result <= 0x7fff`.
-------------
PR: https://git.openjdk.org/jdk/pull/9422
More information about the core-libs-dev
mailing list