Project

General

Profile

Actions

Bug #13136

closed

large_array.sample(11)が遅い

Added by tompng (tomoya ishida) about 7 years ago. Updated about 7 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
[ruby-dev:49956]

Description

Array#sampleのパフォーマンスを改善したい

require 'benchmark'
arr = 100000.times.to_a;
Benchmark.measure{100000.times{arr.sample 10}}.real #=> 0.07052100007422268 速い!
Benchmark.measure{100000.times{arr.sample 11}}.real #=> 37.8880900000222 遅い!!!

現状のArray#sample(n)

n<=3 → 個別場合分け
3<n<=10 → 2重ループ O(n^2)
# n < 適当な閾値 → O(n)のアルゴリズムを使いたい
n>10 → dupしてshuffle O(size) 巨大配列から少数sampleする時に遅い

アルゴリズム改善案について

# n>10の時の現状のアルゴリズム O(size)
def sample arr, n
  arr = arr.dup
  n.times do |i|
    j = rand(arr.size - i) + i
    arr[i], arr[j] = arr[j], arr[i]
  end
  arr.take n
end

# dupをやめ、配列への代入を減らす(ただし元配列を破壊してしまう) O(n)
def sample arr, n
  n.times.map do |i|
    j = rand(arr.size - i) + i
    val = arr[j]
    arr[j] = arr[i]
    val
  end
end

# 配列を書き換える代わりに、どこを入れ替えたかhashに保存する(元配列を破壊しない) O(n)
def sample arr, n
  replace_table = {}
  n.times.map do |i|
    j = rand(arr.size - i) + i
    j2 = replace_table[j] || j
    replace_table[j] = replace_table[i] || i
    arr[j2]
  end
end

patch作りました

ベンチマーク

arr = 100000.times.to_a;
def measure;t=Time.now;yield;Time.now-t;end
p measure{10000.times{arr.sample 10}}
p measure{10000.times{arr.sample 11}}
p measure{10000.times{arr.sample 100}}
p measure{10000.times{arr.sample 1000}}

ruby 2.4.0p0

0.008185
3.938569
4.043285
4.142051

this patch

0.010408
0.016441
0.08861
0.684844

Files

rb_ary_sample.patch (1.42 KB) rb_ary_sample.patch tompng (tomoya ishida), 01/18/2017 01:20 PM
rb_ary_sample_patch2.patch (1.78 KB) rb_ary_sample_patch2.patch tompng (tomoya ishida), 01/19/2017 03:23 PM

Updated by nobu (Nobuyoshi Nakada) about 7 years ago

方針はいいと思います。
RAND_UPTO()randオブジェクトのメソッドを呼ぶ場合例外が起きる可能性があるので、st_free_table(table)rb_ensureにするとか、RARRAY_PTR_USE()を使われないとかしたほうがいいでしょう。

Updated by tompng (tomoya ishida) about 7 years ago

ありがとうございます
patch修正しました。

RAND_UPTOで例外が起きる対策でst_free_tableをrb_ensureに
RAND_UPTO中に配列のサイズが変わる対策としてRAND_UPTOをあらかじめ呼んだあとで配列のサイズ確認
sampleを継続できない程度にサイズが減っていた場合は(n<=10の場合と同様に)空配列を返す

Updated by nobu (Nobuyoshi Nakada) about 7 years ago

別案としてhashの代わりに配列で覚えておくという方法もありますが、元の配列に比例した作業領域を使うのでそのコストが高いかもしれません。

diff --git a/array.c b/array.c
index a19e7bb710..4294de640a 100644
--- a/array.c
+++ b/array.c
@@ -131,6 +131,11 @@ static ID id_cmp, id_div, id_power;
 
 #define ARY_SET(a, i, v) RARRAY_ASET((assert(!ARY_SHARED_P(a)), (a)), (i), (v))
 
+#define tmpbuf(n, size) rb_str_tmp_new((n)*(size))
+#define tmpbuf_discard(s) (rb_str_resize((s), 0L), RBASIC_SET_CLASS_RAW(s, rb_cString))
+#define tmpary(n) rb_ary_tmp_new(n)
+#define tmpary_discard(a) (ary_discard(a), RBASIC_SET_CLASS_RAW(a, rb_cArray))
+
 void
 rb_mem_clear(register VALUE *mem, register long size)
 {
@@ -4797,6 +4802,7 @@ rb_ary_sample(int argc, VALUE *argv, VALUE ary)
     VALUE opts, randgen = rb_cRandom;
     long n, len, i, j, k, idx[10];
     long rnds[numberof(idx)];
+    long memo_threshold;
 
     if (OPTHASH_GIVEN_P(opts)) {
 	VALUE rnd;
@@ -4856,6 +4862,11 @@ rb_ary_sample(int argc, VALUE *argv, VALUE ary)
 	}
 	return rb_ary_new_from_args(3, RARRAY_AREF(ary, i), RARRAY_AREF(ary, j), RARRAY_AREF(ary, k));
     }
+    memo_threshold =
+	len < 2560 ? len / 128 :
+	len < 5120 ? len / 64 :
+	len < 10240 ? len / 32 :
+	len / 16;
     if (n <= numberof(idx)) {
 	long sorted[numberof(idx)];
 	sorted[0] = idx[0] = rnds[0];
@@ -4875,6 +4886,24 @@ rb_ary_sample(int argc, VALUE *argv, VALUE ary)
 	    }
 	});
     }
+    else if (n <= memo_threshold) {
+	VALUE tbl = tmpbuf(len, sizeof(long));
+	long *table = (long*)RSTRING_PTR(tbl);
+	result = rb_ary_new_capa(n);
+	for (i=0; i<n; i++) {
+	    table[i] = -1L;
+	}
+	for (i=0; i<n; i++) {
+	    long v;
+	    long j2 = j = RAND_UPTO(len-i) + i;
+	    long i2 = i;
+	    if ((v = table[i]) >= 0) i2 = v;
+	    if ((v = table[j]) >= 0) j2 = v;
+	    table[j] = i2;
+	    RARRAY_ASET(result, i, RARRAY_AREF(ary, j2));
+	}
+	tmpbuf_discard(tbl);
+    }
     else {
 	result = rb_ary_dup(ary);
 	RBASIC_CLEAR_CLASS(result);
@@ -4955,11 +4984,6 @@ rb_ary_cycle(int argc, VALUE *argv, VALUE ary)
     return Qnil;
 }
 
-#define tmpbuf(n, size) rb_str_tmp_new((n)*(size))
-#define tmpbuf_discard(s) (rb_str_resize((s), 0L), RBASIC_SET_CLASS_RAW(s, rb_cString))
-#define tmpary(n) rb_ary_tmp_new(n)
-#define tmpary_discard(a) (ary_discard(a), RBASIC_SET_CLASS_RAW(a, rb_cArray))
-
 /*
  * Build a ruby array of the corresponding values and yield it to the
  * associated block.
Actions #4

Updated by nobu (Nobuyoshi Nakada) about 7 years ago

  • Status changed from Open to Closed

Applied in changeset r57380.


array.c: improve Array#sample

  • array.c (rb_ary_sample): improve performance when many samples
    from a large array. based on the patch by tomoya ishida
    in [ruby-dev:49956]. [Bug #13136]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0