## Feature #12222

### Introducing basic statistics methods for Enumerable (and optimized implementation for Array)

**Description**

As python has statistics library for calculating mean, variance, etc. of arrays and iterators from version 3.4,

I would like to propose to introduce such features for built-in Enumerable, and optimized implementation for Array.

Especially I want to provide Enumerable#mean and Enumerable#variance as built-in features because they should be implemented by precision compensated algorithms.

The following example shows that we couldn't calculate the standard deviation for some arrays with simple variance algorithm because we get negative variance numbers.

```
class Array
# Kahan summation
def sum
s = 0.0
c = 0.0
n = self.length
i = 0
while i < n
y = self[i] - c
t = s + y
c = (t - s) - y
s = t
i += 1
end
s
end
# precision compensated algorithm
def variance
n = self.length
return Float::NAN if n < 2
m1 = 0.0
m2 = 0.0
i = 0
while i < n
x = self[i]
delta = x - m1
m1 += delta / (i + 1)
m2 += delta*(x - m1)
i += 1
end
m2 / (n - 1)
end
end
ary = [ 1.0000000081806004, 1.0000000009124625, 1.0000000099201818, 1.0000000061821668, 1.0000000042644555 ]
# simple variance algorithm
a = ary.map {|x| x ** 2 }.sum
b = ary.sum ** 2 / ary.length
p (a - b) / (ary.length - 1) #=> -2.220446049250313e-16
# precision compensated algorithm
p ary.variance #=> 1.2248208046392579e-17
```

I think precision compensated algorithm is too complicated to let users implement it.

**Related issues**

### History

#### Updated by mrkn (Kenta Murata) about 3 years ago

**Related to***Feature #12217: Introducing Enumerable#sum for precision compensated summation and revert r54237*added

#### Updated by mrkn (Kenta Murata) about 3 years ago

Especially I want to provide Enumerable#mean and Enumerable#variance as built-in features because they should be implemented by precision compensated algorithms.

Sorry, I don't want to make them be a built-in features. But I want to make them a standard library features, at least.

#### Updated by Eregon (Benoit Daloze) about 3 years ago

It seems to me Enumerable is not the right place for this, because it expects more than just #each.

Also, these methods are likely useful only for numeric collections.

Maybe a "Statistics" module at a stdlib?

Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.

#### Updated by duerst (Martin Dürst) about 3 years ago

Benoit Daloze wrote:

It seems to me Enumerable is not the right place for this, because it expects more than just #each.

The code is currently written in terms of #length and #[], but this can easily be fixed to use #each.

Also, these methods are likely useful only for numeric collections.

Then just don't used them on other collections :-).

Maybe a "Statistics" module at a stdlib?

Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.

Why? I don't see much potential for conflicts. Or does anybody have any mean (as opposed to nice) collections?

Also, as far as I understand, a bigger API doesn't really slow anything down.

I would definitely see providing these (and more) statistical methods for Ruby as a big plus.

#### Updated by Eregon (Benoit Daloze) about 3 years ago

Martin Dürst wrote:

Benoit Daloze wrote:

It seems to me Enumerable is not the right place for this, because it expects more than just #each.

The code is currently written in terms of #length and #[], but this can easily be fixed to use #each.

Also, these methods are likely useful only for numeric collections.

Then just don't used them on other collections :-).

That's my point. Enumerable methods should work on any collection implementing #each.

Not only on #each returning a numeric-type or sth with a #+/#- method (and then the result of #- should respond to #/, etc, so complex semantics if it's not numeric).

Also, what would be the result of calling ["a", "b"].variance ? A NoMethodError?

Currently, it seems Enumerable only relies on each, and for sort* additionally on #<=>.

Maybe a "Statistics" module at a stdlib?

Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.Why? I don't see much potential for conflicts. Or does anybody have any mean (as opposed to nice) collections?

Yes, Statistics would be a potential namespace conflict and indeed direct methods might be less likely.

I am not sure how to evaluate this.

It could also be a module to include (which would then specify clearly its requirement on the elements):

class Sample

include Enumerable, Statistics

def each; ...; end

end

Also, as far as I understand, a bigger API doesn't really slow anything down.

I am not concerned about performance here.

I very much like a exhaustive module like Enumerable, but I think it should stay consistent in what it provides and expects.

I would definitely see providing these (and more) statistical methods for Ruby as a big plus.

Yes, do not get me wrong, I totally agree with that!

Would it make sense to add stuff like #confidence_interval on Enumerable?

I think that would belong more nicely to a Statistics module (in core or standard library).

It's also hard to draw a line between well-known statisctics methods like sum, average, stddev and more complex one like confidence, other mean and variance estimators, etc.

If it's not in Enumerable, there is no need to draw such a line.

#### Updated by matz (Yukihiro Matsumoto) about 3 years ago

**Assignee**changed from*matz (Yukihiro Matsumoto)*to*akr (Akira Tanaka)*

Hi,

I agree with adding `sum`

to `Array`

. It is natural and easy to define.

I disagree (for now) for adding it to `Enumerable`

since it may not be meaningful (e.g. Hash).

Matz.