Project

General

Profile

Actions

Feature #21518

open

Statistical helpers to `Enumerable`

Added by Amitleshed (Amit Leshed) 3 days ago. Updated about 12 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:122842]

Description

Summary

I'd like to add two statistical helpers to Enumerable:

  • Enumerable#average (arithmetic mean)
  • Enumerable#median

Both are small, well-defined operations that many Rubyists re-implement in apps and gems. Providing them in core avoids repeated, ad-hoc code and aligns with Enumerable#sum, which Ruby already ships.

Motivation

  • These are among the most common “roll-your-own” helpers for arrays/ranges of numbers.
  • They are conceptually simple, universally useful beyond web/Rails.
  • Similar to sum, they’re primitives for quick data analysis, ETL scripts, CLI tooling, etc.
  • Including them encourages consistent semantics (what to do with empty sets, mixed numerics, etc.).

Proposed API & Semantics

Enumerable#average -> Float or nil
Enumerable#median  -> Numeric or nil
[1, 2, 3, 4].average      # => 2.5
(1..4).average            # => 2.5
[].average                # => nil

[1, 3, 2].median          # => 2
[1, 2, 3, 10].median      # => 2.5
(1..6).median             # => 3.5
[].median                 # => nil

Ruby implementation

module Enumerable
  def average
    count = 0
    total = 0.0
    each do |x|
      raise TypeError, "non-numeric value for average" unless x.is_a?(Numeric)
      total += x
      count += 1
    end
    count.zero? ? nil : total / count
  end

  def median
    arr = to_a
    return nil if arr.empty?
    arr.each { |x| raise TypeError, "non-numeric value for median" unless x.is_a?(Numeric) }
    arr.sort!
    mid = arr.length / 2
    arr.length.odd? ? arr[mid] : (arr[mid - 1] + arr[mid]) / 2.0
  end
end

Upon approval I'm more than willing to implement spec and code in C.


Related issues 4 (1 open3 closed)

Related to Ruby - Feature #2321: [PATCH] Array Module sum and mean featuresRejected11/01/2009Actions
Related to Ruby - Feature #18057: Introduce Array#meanOpenActions
Related to Ruby - Feature #10228: Statistics moduleFeedback09/11/2014Actions
Related to Ruby - Feature #12222: Introducing basic statistics methods for Enumerable (and optimized implementation for Array)Closedakr (Akira Tanaka)Actions

Updated by Dan0042 (Daniel DeLorme) 2 days ago · Edited

In favor, just careful about the bug in #median

x = [1, 3, 2]
x.median #=> 2
x #=> [1, 2, 3] modified by #median

You'll want to use arr = entries rather than arr = to_a

Updated by Amitleshed (Amit Leshed) 1 day ago

Thanks, great catch!

Updated by herwin (Herwin W) about 17 hours ago

Ranges might need their own specialised implementation: this implementation will timeout on infinite ranges, and (1..100000).average (or .median) can be calculated without having to create an intermediate array. (Why anyone would want to calculate these values from this kind of Ranges is beyond me, but that's another issue)

Updated by Amitleshed (Amit Leshed) about 13 hours ago


Thanks for the engagement everyone


Here's a refactored version:

module Enumerable
  def average
    return nil if none?
    return range_midpoint if numeric_range?

    total = 0.0
    count = 0
    each do |x|
      raise TypeError, "non-numeric value for average" unless x.is_a?(Numeric)
      total += x
      count += 1
    end
    total / count
  end

  def median
    return nil if none?
    return range_midpoint if numeric_range?

    arr = entries
    arr.each { |x| raise TypeError, "non-numeric value for median" unless x.is_a?(Numeric) }
    arr.sort!
    mid = arr.length / 2
    arr.length.odd? ? arr[mid] : (arr[mid - 1] + arr[mid]) / 2.0
  end

  private

  def numeric_range?
    is_a?(Range) && first.is_a?(Numeric) && last.is_a?(Numeric)
  end

  def range_midpoint
    max = exclude_end? ? (last - step) : last
    (first + max) / 2.0
  end
end

Actions #5

Updated by mame (Yusuke Endoh) about 13 hours ago

Actions #6

Updated by mame (Yusuke Endoh) about 13 hours ago

Actions #8

Updated by mame (Yusuke Endoh) about 13 hours ago

  • Related to Feature #12222: Introducing basic statistics methods for Enumerable (and optimized implementation for Array) added

Updated by mame (Yusuke Endoh) about 12 hours ago

Naturally, these methods have been desired by some people for a very long time, but Ruby has historically been very cautious about introducing them. Even the obviously useful #sum method was only added in 2016, which is relatively recent in Ruby's history.

One reason behind this caution is the reluctance to add methods to Array that assume all elements are Integer or Float. Since Array can contain Strings or other non-numeric objects, there's a question of whether it is appropriate to add methods that make no sense in such cases.

The reason why #sum was eventually added was the growing attention to an algorithm called the Kahan-Babuska Summation Algorithm. This is a clever algorithm that reduces floating-point error when summing, and it is actually implemented in Array#sum. Before this algorithm gained attention, I remember the prevailing opinion was that it should be written explicitly, like ary.inject(0, &:+).

For now, you may want to try using https://github.com/red-data-tools/enumerable-statistics to get a better idea of what you actually need.

Actions

Also available in: Atom PDF

Like1
Like1Like0Like1Like0Like0Like0Like0Like0