Project

General

Profile

Actions

Bug #13136

closed

large_array.sample(11)が遅い

Added by tompng (tomoya ishida) over 7 years ago. Updated over 7 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
[ruby-dev:49956]

Description

Array#sampleのパフォーマンスを改善したい

require 'benchmark'
arr = 100000.times.to_a;
Benchmark.measure{100000.times{arr.sample 10}}.real #=> 0.07052100007422268 速い!
Benchmark.measure{100000.times{arr.sample 11}}.real #=> 37.8880900000222 遅い!!!

現状のArray#sample(n)

n<=3 → 個別場合分け
3<n<=10 → 2重ループ O(n^2)
# n < 適当な閾値 → O(n)のアルゴリズムを使いたい
n>10 → dupしてshuffle O(size) 巨大配列から少数sampleする時に遅い

アルゴリズム改善案について

# n>10の時の現状のアルゴリズム O(size)
def sample arr, n
  arr = arr.dup
  n.times do |i|
    j = rand(arr.size - i) + i
    arr[i], arr[j] = arr[j], arr[i]
  end
  arr.take n
end

# dupをやめ、配列への代入を減らす(ただし元配列を破壊してしまう) O(n)
def sample arr, n
  n.times.map do |i|
    j = rand(arr.size - i) + i
    val = arr[j]
    arr[j] = arr[i]
    val
  end
end

# 配列を書き換える代わりに、どこを入れ替えたかhashに保存する(元配列を破壊しない) O(n)
def sample arr, n
  replace_table = {}
  n.times.map do |i|
    j = rand(arr.size - i) + i
    j2 = replace_table[j] || j
    replace_table[j] = replace_table[i] || i
    arr[j2]
  end
end

patch作りました

ベンチマーク

arr = 100000.times.to_a;
def measure;t=Time.now;yield;Time.now-t;end
p measure{10000.times{arr.sample 10}}
p measure{10000.times{arr.sample 11}}
p measure{10000.times{arr.sample 100}}
p measure{10000.times{arr.sample 1000}}

ruby 2.4.0p0

0.008185
3.938569
4.043285
4.142051

this patch

0.010408
0.016441
0.08861
0.684844

Files

rb_ary_sample.patch (1.42 KB) rb_ary_sample.patch tompng (tomoya ishida), 01/18/2017 01:20 PM
rb_ary_sample_patch2.patch (1.78 KB) rb_ary_sample_patch2.patch tompng (tomoya ishida), 01/19/2017 03:23 PM
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0