[lworld+fp16] RFR: 8329817: Augment prototype Float16 class [v5]

Joe Darcy darcy at openjdk.org
Mon Jun 17 20:17:31 UTC 2024


On Fri, 14 Jun 2024 06:19:28 GMT, Joe Darcy <darcy at openjdk.org> wrote:

>>> Hi @jddarcy , Apart from few minor comments, patch looks good to me, there is a build error due to malformed javadoc comment.
>>> 
>>> Kindly fix and integrate.
>> 
>> Thanks; let me take a pass at writing at least some basic regression tests before pushing.
>> 
>> @jatin-bhateja , do you know if promoting the three operands of a Float16 fma to double, doing the operation in double, and rounding to Float16 is sufficient to correctly implement a Float16 fma? I haven't worked through all the cases yet and I'm not certain they're cannot be double-rounding issues. (If double rounding turns out to be a problem, I was thinking it would be possible to see if (a*b + c) was exact in double, and if, not add in a sticky bit to make sure the rounding occurs properly, but I haven't developed the details yet.)
>
>> > > Hi @jddarcy , Apart from few minor comments, patch looks good to me, there is a build error due to malformed javadoc comment.
>> > > Kindly fix and integrate.
>> > 
>> > 
>> > Thanks; let me take a pass at writing at least some basic regression tests before pushing.
>> > @jatin-bhateja , do you know if promoting the three operands of a Float16 fma to double, doing the operation in double, and rounding to Float16 is sufficient to correctly implement a Float16 fma? I haven't worked through all the cases yet and I'm not certain they're cannot be double-rounding issues. (If double rounding turns out to be a problem, I was thinking it would be possible to see if (a*b + c) was exact in double, and if, not add in a sticky bit to make sure the rounding occurs properly, but I haven't developed the details yet.)
>> 
>> Hi @jddarcy , As per specification of [Math.fma(float, float, float)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Math.java#L2494) internal computation of constituent operation (mul and add) should be done at infinite precision and only final result should be rounded, we are now upcasting float16 to double but it will not prevent rounding happening for mul and add.
> 
> Right; the fma needs to operate as-if it used infinite precision internally. This could be implemented (slowly) using JDK classes by implementing a BigDecimal -> Float16 conversion after doing a*b+c in BigDeicmal. I've been considering adding BigDecimal -> Float16 conversion anyway for completeness in the platform.
> 
> My understanding of how fma is implemented in hardware is that for a format with P bits of precision, there is a ~2P wide internal register to hold the exact product as an intermediate result. Then the value being adding in can be aligned at the right exponent location and the final rounding back to P bits of precision can occur, with logic of any sticky bit for rounding to nearest even, etc.
> 
> There are many cases where double (P = 53) will exactly hold the product and sum of three Float16 (P = 11) operands.  However, the product can be so large or so small that rounding  occurs when the third operand is added in.
> 
> I haven't worked through if the potential round-offs are all benign with the final rounding to Float16 or if some corrective action would need to be taken to get the effect of a sticky bit. For example, if a*b is so large that the highest exponent position set is more than 53 positions away from the lowest exponent position set on ...

> > > > Hi @jddarcy , Apart from few minor comments, patch looks good to me, there is a build error due to malformed javadoc comment.
> > > > Kindly fix and integrate.
> > > 
> > > 
> > > Thanks; let me take a pass at writing at least some basic regression tests before pushing.
> > > @jatin-bhateja , do you know if promoting the three operands of a Float16 fma to double, doing the operation in double, and rounding to Float16 is sufficient to correctly implement a Float16 fma? I haven't worked through all the cases yet and I'm not certain they're cannot be double-rounding issues. (If double rounding turns out to be a problem, I was thinking it would be possible to see if (a*b + c) was exact in double, and if, not add in a sticky bit to make sure the rounding occurs properly, but I haven't developed the details yet.)
> > 
> > 
> > Hi @jddarcy , As per specification of [Math.fma(float, float, float)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Math.java#L2494) internal computation of constituent operation (mul and add) should be done at infinite precision and only final result should be rounded, we are now upcasting float16 to double but it will not prevent rounding happening for mul and add.
> 
> Right; the fma needs to operate as-if it used infinite precision internally. This could be implemented (slowly) using JDK classes by implementing a BigDecimal -> Float16 conversion after doing a*b+c in BigDeicmal. I've been considering adding BigDecimal -> Float16 conversion anyway for completeness in the platform.
> 
> My understanding of how fma is implemented in hardware is that for a format with P bits of precision, there is a ~2P wide internal register to hold the exact product as an intermediate result. Then the value being adding in can be aligned at the right exponent location and the final rounding back to P bits of precision can occur, with logic of any sticky bit for rounding to nearest even, etc.
> 
> There are many cases where double (P = 53) will exactly hold the product and sum of three Float16 (P = 11) operands. However, the product can be so large or so small that rounding occurs when the third operand is added in.
> 
> I haven't worked through if the potential round-offs are all benign with the final rounding to Float16 or if some corrective action would need to be taken to get the effect of a sticky bit. For example, if a_b is so large that the highest exponent position set is more than 53 positions away from the lowest exponent position set on c, if the final result is going to overflow anyway, the round-off in computing a_b+c in double doesn't matter. It might be problematic if a*b is much smaller than c, but that is another case I haven't fully thought through yet.
> 
> I'll give an updated on my analysis/research on this fma issue by next week.

An update: promoting the three float16 fma operands to double and then computing (a*b) + c in double and rounding once at the end to Float16 works for (at least) most possible operands to a Float16 fma. It may work for all operands; I'm still working through those details.

To summarize my findings. if the product-sum held in double is exact, the single rounding to Float16 will compute the correct result.

Each Float16 operand is exactly convertible to double and the a*b product is exact since double has more than twice the precision of Float16. That leaves analyzing whether or not the product-sum is exact.

For Float16 values, the possible exponent bit positions that could be set range from 2^15 (from MAX_VALUE) to 2^(-24) (from MIN_VALUE). This entire exponent range of 15 - (-24) + 1 = 40 is less than the _precision_ of double, 53 bits.

When multiplying two Float16 values together, the exponent bit positions set in the exact product range over 2^31 (MAX_VALUE squared) to 2^(-48) (MIN_VALUE) squared.

If the exact product is larger than about 2^16, the final rounded result must overflow Float16. If the exact product has a highest exponent bit of 2^16, the product-sum will be exact since a double could hold a value as small as 2^(16 - 52) = 2^(-36), which is smaller than Float16.MIN_VALUE. Therefore, when the product has an exponent of 2^16 or larger, the end result is correct after conversion to Float16.

A similar argument holds when the product has an exponent in the normal range of Float16; any additional sum will be exact given the precision of double.

However, if the product is sufficiently small, it is possible the double will not be able to hold the exact value since the exponent range of 2^(-48) to 2^(15) is larger than double precision. However, for many operands, even if the product has underflowed, adding in many operands may be exact in practice. This exactness can be tested for using the 2sum algorithm (https://en.wikipedia.org/wiki/2Sum) for a zero trailing "t" component.

Fortunately in terms of computing the final answer and rounding complications, if c is the subnormal range of Float16 (exponent range 2^(-14) to 2(-24)), the product-sum will be exactly representable in double.

That leaves possibly non-exact product-sum with a combination of product in the subnormal range of Float16 and the c term to be added in being not-small. However, if this product-sum is non-exact, the smaller term from the product, with at most 22 exponent bit positions set, and the the 11 bits from c being summed in, must be separated by at least 53 - (22 + 11) = 20 bit positions otherwise the product-sum would fit in a double. I believe this implies at least one of the double-rounding scenarios cannot occur, in particular a half-way result in the smaller precision, Float16 in this case, rounding differently because sticky bit information from the higher precision was rounded away.

I'll keep working through the non-exact cases. Assuming a `Float16 valueOf(BigDecimal)` method is added, an "obviously right" approach would be to use a*b + c in double for all exact cases and failover to computing the exact result in BigDecimal and then rounding to that to Float16 for all the inexact cases.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1117#issuecomment-2174342914


More information about the valhalla-dev mailing list