Spec and API review for {Int,Long,Double}SummaryStatistics

Brian Goetz brian.goetz at oracle.com
Mon Apr 1 07:47:23 PDT 2013


The motivation for sumOfSquares() is indeed to help in calculation of 
variance.  As you've noted, there are multiple forms this can take 
(e.g., sample vs population).  Modulo numerical issues, sum(sq) is an 
input to all the various forms, so we theoretically stay out of 
whack-a-mole territory by providing this form rather than trying to 
provide all the various forms people might want.

Note that *not* providing any help here is a disaster for those who want 
it; they have to materialize the collection and then make two passes. 
Its not like those users can just (safely) extend the summary statistics 
to also calculate the part they need.

Note also that for numeric types like long, there are no numerical 
issues.  So punishing long for his brother's instability just seems mean.

(For those following at home: the formalae for variance involve:
   sum((x_i - \bar x)^2)

where \bar x is the average of x.  Which means you would first have to 
make a pass to find the average, and then make another pass to calculate 
sum of squares of deviation from the mean.  Factoring the above:

   (x_i - \bar x)^2 = x_i^2 - 2 x_i \bar x + (\bar x)^2

So the sum can be expressed in terms of average and sum of squares, and 
done in a single pass.  But unfortunately since squaring makes big 
numbers bigger and small numbers smaller, you end up risking adding 
10^20 and 10^-20 and losing data when done with floating points.)

On 4/1/2013 10:23 AM, Kevin Bourrillion wrote:
> I'm confused, but I've seen nothing to change my impression that
> exposing sumOfSquares is not helpful. As unpleasant as it may seem, if
> we want to address the variance case at all, I think we have little
> choice but to expose sampleVariance() and populationVariance()
> ourselves, and then /those/ can use Kahan summation or whatever (which
> internally computes "sum of squares of deltas", not sum of squares, as I
> (don't) understand it).
>
>
> On Fri, Mar 29, 2013 at 3:16 PM, Brian Goetz <brian.goetz at oracle.com
> <mailto:brian.goetz at oracle.com>> wrote:
>
>      > Also, while I'm here...
>      >
>      > Exposing sumOfSquares() does not permit users to safely calculate
>     variance, which I believe makes it fairly useless and even dangerous:
>      >
>      > "The failure of Cauchy's fundamental inequality is another
>     important example of the breakdown of traditional algebra in the
>     presence of floating point arithmetic...Novice programmers who
>     calculate the standard deviation of some observations by using the
>     textbook formula [formula for the standard deviation in terms of the
>     sum of squares] often find themselves taking the square root of a
>     negative number!"  (Knuth AoCP vol 2, section 4.2.2)
>
>     Thanks for raising this issue again -- I'd meant to respond earlier.
>       I ran this by our numerics guys.
>
>     Basically, the problem is that for floating point numbers, since
>     squaring makes small numbers smaller and big numbers bigger, summing
>     squares in the obvious way risks the usual problem with adding
>     numbers of grossly differing magnitudes. So while the naive
>     factoring of population/sample variance allows you to compute them
>     from sum(x) and sum(x^2), the latter is potentially numerically
>     challenged.  (Note that this problem doesn't exist for int/long,
>     assuming a long is big enough to compute sum(x^2) without overflow.)
>
>     Still, I am not sure we do users a favor by leaving this out.  Many
>     of them are likely to simply extend DoubleSummaryStatistics to
>     calculate sum(x^2) anyway.  And the only other alternative is
>     horrible; stream the data into a collection and make two passes on
>     it, one for mean and one for variance.  That's at least 3x as
>     expensive, if you can fit the whole thing in memory in the first place.
>
>     The Knuth section you cite also offers a means to calculate variance
>     more effectively in a single pass using a recurrence relation based
>     on Kahan summation.  So I think the winning move is to provide a
>     better implementation of sumsq than either of the naive
>     implementations above, one that uses real numerics fu.  (We intend
>     to provide a better implementation of summation for
>     DoubleSummaryStatistics as well, based on Kahan.)
>
>     Of course the crappy implementation that is in there now is less
>     than ideal.
>
>
>
>
>
> --
> Kevin Bourrillion | Java Librarian | Google, Inc. |kevinb at google.com
> <mailto:kevinb at google.com>


More information about the lambda-libs-spec-observers mailing list