Spec and API review for {Int,Long,Double}SummaryStatistics

Mon Apr 1 07:23:27 PDT 2013

I'm confused, but I've seen nothing to change my impression that exposing
sumOfSquares is not helpful. As unpleasant as it may seem, if we want to
address the variance case at all, I think we have little choice but to
expose sampleVariance() and populationVariance() ourselves, and then
*those* can
use Kahan summation or whatever (which internally computes "sum of squares
of deltas", not sum of squares, as I (don't) understand it).

On Fri, Mar 29, 2013 at 3:16 PM, Brian Goetz <brian.goetz at oracle.com> wrote:

> > Also, while I'm here...
> >
> > Exposing sumOfSquares() does not permit users to safely calculate
> variance, which I believe makes it fairly useless and even dangerous:
> >
> > "The failure of Cauchy's fundamental inequality is another important
> example of the breakdown of traditional algebra in the presence of floating
> point arithmetic...Novice programmers who calculate the standard deviation
> of some observations by using the textbook formula [formula for the
> standard deviation in terms of the sum of squares] often find themselves
> taking the square root of a negative number!"  (Knuth AoCP vol 2, section
> 4.2.2)
>
> Thanks for raising this issue again -- I'd meant to respond earlier.  I
> ran this by our numerics guys.
>
> Basically, the problem is that for floating point numbers, since squaring
> makes small numbers smaller and big numbers bigger, summing squares in the
> obvious way risks the usual problem with adding numbers of grossly
> differing magnitudes. So while the naive factoring of population/sample
> variance allows you to compute them from sum(x) and sum(x^2), the latter is
> potentially numerically challenged.  (Note that this problem doesn't exist
> for int/long, assuming a long is big enough to compute sum(x^2) without
> overflow.)
>
> Still, I am not sure we do users a favor by leaving this out.  Many of
> them are likely to simply extend DoubleSummaryStatistics to calculate
> sum(x^2) anyway.  And the only other alternative is horrible; stream the
> data into a collection and make two passes on it, one for mean and one for
> variance.  That's at least 3x as expensive, if you can fit the whole thing
> in memory in the first place.
>
> The Knuth section you cite also offers a means to calculate variance more
> effectively in a single pass using a recurrence relation based on Kahan
> summation.  So I think the winning move is to provide a better
> implementation of sumsq than either of the naive implementations above, one
> that uses real numerics fu.  (We intend to provide a better implementation
> of summation for DoubleSummaryStatistics as well, based on Kahan.)
>
> Of course the crappy implementation that is in there now is less than
> ideal.
>
>
>

-- 
Kevin Bourrillion | Java Librarian | Google, Inc. | kevinb at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/lambda-libs-spec-experts/attachments/20130401/08d1bbd5/attachment.html