Project

General

Profile

Feature #12222

Introducing basic statistics methods for Enumerable (and optimized implementation for Array)

Added by mrkn (Kenta Murata) over 1 year ago. Updated about 1 year ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:74607]

Description

As python has statistics library for calculating mean, variance, etc. of arrays and iterators from version 3.4,
I would like to propose to introduce such features for built-in Enumerable, and optimized implementation for Array.

Especially I want to provide Enumerable#mean and Enumerable#variance as built-in features because they should be implemented by precision compensated algorithms.
The following example shows that we couldn't calculate the standard deviation for some arrays with simple variance algorithm because we get negative variance numbers.

class Array
  # Kahan summation
  def sum
    s = 0.0
    c = 0.0
    n = self.length
    i = 0
    while i < n
      y = self[i] - c
      t = s + y
      c = (t - s) - y
      s = t
      i += 1
    end
    s
  end

  # precision compensated algorithm
  def variance
    n = self.length
    return Float::NAN if n < 2
    m1 = 0.0
    m2 = 0.0
    i = 0
    while i < n
      x = self[i]
      delta = x - m1
      m1 += delta / (i + 1)
      m2 += delta*(x - m1)
      i += 1
    end
    m2 / (n - 1)
  end
end

ary = [ 1.0000000081806004, 1.0000000009124625, 1.0000000099201818, 1.0000000061821668, 1.0000000042644555 ]

# simple variance algorithm
a = ary.map {|x| x ** 2 }.sum
b = ary.sum ** 2 / ary.length
p (a - b) / (ary.length - 1)  #=> -2.220446049250313e-16

# precision compensated algorithm
p ary.variance  #=> 1.2248208046392579e-17

I think precision compensated algorithm is too complicated to let users implement it.


Related issues

Related to Ruby trunk - Feature #12217: Introducing Enumerable#sum for precision compensated summation and revert r54237 Closed

History

#1 Updated by mrkn (Kenta Murata) over 1 year ago

  • Related to Feature #12217: Introducing Enumerable#sum for precision compensated summation and revert r54237 added

#2 [ruby-core:74608] Updated by mrkn (Kenta Murata) over 1 year ago

Especially I want to provide Enumerable#mean and Enumerable#variance as built-in features because they should be implemented by precision compensated algorithms.

Sorry, I don't want to make them be a built-in features. But I want to make them a standard library features, at least.

#3 [ruby-core:74617] Updated by Eregon (Benoit Daloze) over 1 year ago

It seems to me Enumerable is not the right place for this, because it expects more than just #each.
Also, these methods are likely useful only for numeric collections.

Maybe a "Statistics" module at a stdlib?
Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.

#4 [ruby-core:74733] Updated by duerst (Martin Dürst) about 1 year ago

Benoit Daloze wrote:

It seems to me Enumerable is not the right place for this, because it expects more than just #each.

The code is currently written in terms of #length and #[], but this can easily be fixed to use #each.

Also, these methods are likely useful only for numeric collections.

Then just don't used them on other collections :-).

Maybe a "Statistics" module at a stdlib?
Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.

Why? I don't see much potential for conflicts. Or does anybody have any mean (as opposed to nice) collections?

Also, as far as I understand, a bigger API doesn't really slow anything down.

I would definitely see providing these (and more) statistical methods for Ruby as a big plus.

#5 [ruby-core:74736] Updated by Eregon (Benoit Daloze) about 1 year ago

Martin Dürst wrote:

Benoit Daloze wrote:

It seems to me Enumerable is not the right place for this, because it expects more than just #each.

The code is currently written in terms of #length and #[], but this can easily be fixed to use #each.

Also, these methods are likely useful only for numeric collections.

Then just don't used them on other collections :-).

That's my point. Enumerable methods should work on any collection implementing #each.
Not only on #each returning a numeric-type or sth with a #+/#- method (and then the result of #- should respond to #/, etc, so complex semantics if it's not numeric).
Also, what would be the result of calling ["a", "b"].variance ? A NoMethodError?
Currently, it seems Enumerable only relies on each, and for sort* additionally on #<=>.

Maybe a "Statistics" module at a stdlib?
Statistics.mean/variance/etc(enum) would be a nicer API than mixing everything in Enumerable IMHO.

Why? I don't see much potential for conflicts. Or does anybody have any mean (as opposed to nice) collections?

Yes, Statistics would be a potential namespace conflict and indeed direct methods might be less likely.
I am not sure how to evaluate this.

It could also be a module to include (which would then specify clearly its requirement on the elements):
class Sample
include Enumerable, Statistics
def each; ...; end
end

Also, as far as I understand, a bigger API doesn't really slow anything down.

I am not concerned about performance here.
I very much like a exhaustive module like Enumerable, but I think it should stay consistent in what it provides and expects.

I would definitely see providing these (and more) statistical methods for Ruby as a big plus.

Yes, do not get me wrong, I totally agree with that!

Would it make sense to add stuff like #confidence_interval on Enumerable?
I think that would belong more nicely to a Statistics module (in core or standard library).

It's also hard to draw a line between well-known statisctics methods like sum, average, stddev and more complex one like confidence, other mean and variance estimators, etc.
If it's not in Enumerable, there is no need to draw such a line.

#6 [ruby-core:74921] Updated by matz (Yukihiro Matsumoto) about 1 year ago

  • Assignee changed from matz (Yukihiro Matsumoto) to akr (Akira Tanaka)

Hi,

I agree with adding sum to Array. It is natural and easy to define.
I disagree (for now) for adding it to Enumerable since it may not be meaningful (e.g. Hash).

Matz.

Also available in: Atom PDF