Spec and API review for {Int,Long,Double}SummaryStatistics

Jim Mayer jim at pentastich.org
Mon Apr 1 10:53:32 PDT 2013


On Mon, Apr 1, 2013 at 10:47 AM, Brian Goetz <brian.goetz at oracle.com> wrote:

> The motivation for sumOfSquares() is indeed to help in calculation of
> variance.  As you've noted, there are multiple forms this can take (e.g.,
> sample vs population).  Modulo numerical issues, sum(sq) is an input to all
> the various forms, so we theoretically stay out of whack-a-mole territory
> by providing this form rather than trying to provide all the various forms
> people might want.
>
> Note that *not* providing any help here is a disaster for those who want
> it; they have to materialize the collection and then make two passes. Its
> not like those users can just (safely) extend the summary statistics to
> also calculate the part they need.
>
> Note also that for numeric types like long, there are no numerical issues.
>  So punishing long for his brother's instability just seems mean.
>
>
Sadly, while this is true for sumsq, is is not true for the calculation of
variance using the sum of squares.  The problems occur when either the
individual values or N is large.  Here's an example:

Sample size: 100
Values: all values are 1, except for one that is 2.
sumsq -> 103
sum(x)^2/N -> 102.01
sumsq-sum(x)^2/N -> 0.99

Sample size: 1000000
Values: all values are 1, except for one that is 2.
sumsq -> 1000003
sum(x)^2/N -> 1000002.000001
sumsq-sum(x)^2/N -> .999999

Basically, as N gets bigger the sums get larger and larger while the
variance approaches one.  This is an unstable computation.

Jim Mayer


More information about the lambda-libs-spec-observers mailing list