Spec and API review for {Int,Long,Double}SummaryStatistics
Jim Mayer
jim at pentastich.org
Mon Apr 1 10:53:32 PDT 2013
On Mon, Apr 1, 2013 at 10:47 AM, Brian Goetz <brian.goetz at oracle.com> wrote:
> The motivation for sumOfSquares() is indeed to help in calculation of
> variance. As you've noted, there are multiple forms this can take (e.g.,
> sample vs population). Modulo numerical issues, sum(sq) is an input to all
> the various forms, so we theoretically stay out of whack-a-mole territory
> by providing this form rather than trying to provide all the various forms
> people might want.
>
> Note that *not* providing any help here is a disaster for those who want
> it; they have to materialize the collection and then make two passes. Its
> not like those users can just (safely) extend the summary statistics to
> also calculate the part they need.
>
> Note also that for numeric types like long, there are no numerical issues.
> So punishing long for his brother's instability just seems mean.
>
>
Sadly, while this is true for sumsq, is is not true for the calculation of
variance using the sum of squares. The problems occur when either the
individual values or N is large. Here's an example:
Sample size: 100
Values: all values are 1, except for one that is 2.
sumsq -> 103
sum(x)^2/N -> 102.01
sumsq-sum(x)^2/N -> 0.99
Sample size: 1000000
Values: all values are 1, except for one that is 2.
sumsq -> 1000003
sum(x)^2/N -> 1000002.000001
sumsq-sum(x)^2/N -> .999999
Basically, as N gets bigger the sums get larger and larger while the
variance approaches one. This is an unstable computation.
Jim Mayer
More information about the lambda-libs-spec-observers
mailing list