Spec and API review for {Int,Long,Double}SummaryStatistics
Brian Goetz
brian.goetz at oracle.com
Fri Mar 29 15:16:39 PDT 2013
> Also, while I'm here...
>
> Exposing sumOfSquares() does not permit users to safely calculate variance, which I believe makes it fairly useless and even dangerous:
>
> "The failure of Cauchy's fundamental inequality is another important example of the breakdown of traditional algebra in the presence of floating point arithmetic...Novice programmers who calculate the standard deviation of some observations by using the textbook formula [formula for the standard deviation in terms of the sum of squares] often find themselves taking the square root of a negative number!" (Knuth AoCP vol 2, section 4.2.2)
Thanks for raising this issue again -- I'd meant to respond earlier. I ran this by our numerics guys.
Basically, the problem is that for floating point numbers, since squaring makes small numbers smaller and big numbers bigger, summing squares in the obvious way risks the usual problem with adding numbers of grossly differing magnitudes. So while the naive factoring of population/sample variance allows you to compute them from sum(x) and sum(x^2), the latter is potentially numerically challenged. (Note that this problem doesn't exist for int/long, assuming a long is big enough to compute sum(x^2) without overflow.)
Still, I am not sure we do users a favor by leaving this out. Many of them are likely to simply extend DoubleSummaryStatistics to calculate sum(x^2) anyway. And the only other alternative is horrible; stream the data into a collection and make two passes on it, one for mean and one for variance. That's at least 3x as expensive, if you can fit the whole thing in memory in the first place.
The Knuth section you cite also offers a means to calculate variance more effectively in a single pass using a recurrence relation based on Kahan summation. So I think the winning move is to provide a better implementation of sumsq than either of the naive implementations above, one that uses real numerics fu. (We intend to provide a better implementation of summation for DoubleSummaryStatistics as well, based on Kahan.)
Of course the crappy implementation that is in there now is less than ideal.
More information about the lambda-libs-spec-observers
mailing list