Bug #20489
closedRactor behavior strange in ruby master
Description
This is a tarai program
Run./ruby tarai_ractor.rb 2 8
is to use 2 thread to run 8 times tarai function total, that means 4 times tarai for each ractor(thread).
GC.disable
def split_len(len, split)
ret = []
mod = len % split
head = 0
tail = 0
split.times do |i|
if head >= len
break
end
k = 0
if i < mod then k = 1 end
tail = tail + (len/split) + k
ret.append(head...tail)
head = tail
end
return ret
end
def ary_split(ary, split)
return split_len(ary.length,split)
end
def item_check(item)
if item[0] != nil
1 + item_check(item[0]) + item_check(item[1])
else
1
end
end
def tarai(x, y, z) =
x <= y ? y : tarai(tarai(x-1, y, z),
tarai(y-1, z, x),
tarai(z-1, x, y))
times = ARGV[0].to_i
split = ARGV[1].to_i
p split_len(times, split)
split_len(times, split).each.map do |sp|
Ractor.new (sp) {
s = _1
s.each do
tarai(13, 7, 0)
end
}
end.each(&:take)
The problem is in ruby 3.1.2 and ruby 3.3
./ruby tarai_ractor.rb 1 1
has simiular execute time with ./ruby tarai_ractor.rb 8 8
because each thread only run 1 time of tarai function, like follow:
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
time ruby exp_ractor_tarai.rb 1 1
[0...1]
<internal:ractor>:267: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
real 0m1.442s
user 0m1.429s
sys 0m0.014s
time ruby exp_ractor_tarai.rb 8 8
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
<internal:ractor>:267: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
real 0m1.857s
user 0m13.817s
sys 0m0.041s
But in ruby master(ruby 3.4.0dev)
ruby 3.4.0dev (2024-05-14T01:58:31Z master 9d01f657b3) [x86_64-linux]
1 ractor 1 tarai:
time ../ruby exp_ractor_tarai.rb 1 1
`RubyGems' were not loaded.
`error_highlight' was not loaded.
`did_you_mean' was not loaded.
`syntax_suggest' was not loaded.
[0...1]
exp_ractor_tarai.rb:47: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
real 0m1.671s
user 0m1.666s
sys 0m0.005s
8 ractor 8 tarai:
time ../ruby exp_ractor_tarai.rb 8 8
`RubyGems' were not loaded.
`error_highlight' was not loaded.
`did_you_mean' was not loaded.
`syntax_suggest' was not loaded.
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
exp_ractor_tarai.rb:47: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
real 0m18.408s
user 1m58.659s
sys 0m0.021s
And in ruby 3.4.0dev when run time ../ruby exp_ractor_tarai.rb 16 16
16 thread should be used in system monitoring while only 8 threads are used.
Ruby 3.3 and Ruby 3.1.2 do not have this problem.
Files
Updated by ko1 (Koichi Sasada) 9 months ago
Thank you. I'll investigate it.
Updated by hsbt (Hiroshi SHIBATA) 5 months ago
- Status changed from Open to Assigned
- Assignee set to ko1 (Koichi Sasada)
Updated by luke-gru (Luke Gruber) 27 days ago
For me the runtime goes up a lot when I increase # of ractors when VM_CHECK_MODE > 0
, but in non-debug mode it runs well. Also, right now the native thread # used for non-main ractors is capped at 8 because we don't actually probe the OS for the number of cores available. Were you running a debug build?
Updated by tenderlovemaking (Aaron Patterson) 27 days ago
- Status changed from Assigned to Feedback
I also can't seem to reproduce this:
[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ruby -v test.rb 8 8
ruby 3.3.4 (2024-07-16 revision 425e468d25) [arm64-darwin23]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 5.97 secs fish external
usr time 28.54 secs 78.00 micros 28.54 secs
sys time 0.14 secs 736.00 micros 0.14 secs
[aaron@tc-lan-adapter ~/g/ruby (master)]$ chruby ruby-3.4.1
[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ruby -v test.rb 8 8
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 6.42 secs fish external
usr time 28.45 secs 0.09 millis 28.45 secs
sys time 0.09 secs 1.33 millis 0.08 secs
[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ./ruby -v test.rb 8 8
ruby 3.5.0dev (2025-01-08T20:42:35Z master 96f23306f0) +PRISM [arm64-darwin24]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 6.04 secs fish external
usr time 28.74 secs 0.10 millis 28.74 secs
sys time 0.10 secs 1.31 millis 0.10 secs
Updated by tenderlovemaking (Aaron Patterson) 27 days ago
- File Screenshot 2025-01-08 at 4.02.23 PM.png Screenshot 2025-01-08 at 4.02.23 PM.png added
- Status changed from Feedback to Assigned
Sorry, I'm totally wrong. I misread the reproduction steps. I can reproduce this problem.
The way I tested is with this command on different Ruby tags:
hyperfine './miniruby -v ../test.rb 1 1' './miniruby -v ../test.rb 8 8'
I've attached a screenshot of a graph that compares performance. It looks like on Ruby 3.1.6 and 3.2.4, tarai(8, 8)
is about ~40% slower, but on Ruby 3.3.6 it's about 10x slower, then on Ruby 3.4.1 it's about 5.5x slower. I will see if I can find the bottleneck.
Updated by tenderlovemaking (Aaron Patterson) 27 days ago
It looks like the VM loop is much slower in Ruby 3.5 than Ruby 3.2:
Updated by tenderlovemaking (Aaron Patterson) 27 days ago
The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count
is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.
Updated by nekoyama32767 (Jinsong Yu) 27 days ago · Edited
tenderlovemaking (Aaron Patterson) wrote in #note-7:
The regression in Ruby 3.3 came from this commit. It seems like
rb_vm_insns_count
is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.
Replace tarai
with a very long while
loop(like while i < 100000000; i+=1; end;
) can also produce same behavior. I also tried to profiling this loop version. VM loops take much more time when more than 2 ractors are running.
Updated by nekoyama32767 (Jinsong Yu) 27 days ago
tenderlovemaking (Aaron Patterson) wrote in #note-7:
The regression in Ruby 3.3 came from this commit. It seems like
rb_vm_insns_count
is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.
Comment out rb_vm_insns_count
increasment in vm_insnhelper.h
can fix this problem. I think that this increasment may cause cpu cache miss, and finally slow down the VM loops.
Updated by nekoyama32767 (Jinsong Yu) 27 days ago
tenderlovemaking (Aaron Patterson) wrote in #note-7:
The regression in Ruby 3.3 came from this commit. It seems like
rb_vm_insns_count
is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.
Actually, on a x86_64 linux virtual machine, 3.4 is ~14x slower. But I don't know why 3.4 become faster than 3.3 on arm64
Updated by tenderlovemaking (Aaron Patterson) 26 days ago
nekoyama32767 (Jinsong Yu) wrote in #note-10:
tenderlovemaking (Aaron Patterson) wrote in #note-7:
The regression in Ruby 3.3 came from this commit. It seems like
rb_vm_insns_count
is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.Actually, on a x86_64 linux virtual machine, 3.4 is ~14x slower. But I don't know why 3.4 become faster than 3.3 on arm64
Yes, I'm seeing the same thing on my x86 machine.
I tried changing the counter to a thread local, and that seems to improve speed on my x86.
Here is before:
$ time ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-08T20:42:35Z master 96f23306f0) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 25.30 secs fish external
usr time 152.10 secs 265.00 micros 152.10 secs
sys time 0.01 secs 226.00 micros 0.01 secs
Here is after:
$ time ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-09T19:03:26Z tl 5bfb7753e3) +PRISM [x86_64-linux]
last_commit=move insn counting to a thread local
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 4.70 secs fish external
usr time 19.08 secs 252.00 micros 19.08 secs
sys time 0.00 secs 210.00 micros 0.00 secs
Unfortunately, it's still not as fast as tarai(1, 1)
:
$ time ./miniruby -v ../test.rb 1 1
ruby 3.5.0dev (2025-01-09T19:03:26Z tl 5bfb7753e3) +PRISM [x86_64-linux]
last_commit=move insn counting to a thread local
[0...1]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 2.30 secs fish external
usr time 2.30 secs 8.34 millis 2.30 secs
sys time 0.01 secs 4.17 millis 0.00 secs
This counter is for YJIT statistics, but from what I understand, YJIT is basically disabled when Ractors are enabled. To me, this means that putting the counter in a thread local is an acceptable change.
I will clean up the patch I made and then ask other people on the YJIT team about it.
Updated by luke-gru (Luke Gruber) 26 days ago
Ah I was running a non-yjit build. Good work guys, that's a weird bug for sure. The thread-local wouldn't work with ractors because native threads can run different ractors but if YJIT is disabled when you start a Ractor I don't see a problem with the solution, like you said.
Updated by nekoyama32767 (Jinsong Yu) 26 days ago
Unfortunately, it's still not as fast as tarai(1, 1):
This may cause by a native thread and ractor thread mapping setting. Set RUBY_MAX_CPU
to a larger number, in my 12 physic core environment, result is as faster as tarai 1 1
Updated by tenderlovemaking (Aaron Patterson) 25 days ago
- Status changed from Assigned to Closed
Applied in changeset git|50c2c4bddee5a2c5edc0d9fc0f635befdfc4bacc.
Make rb_vm_insns_count a thread local variable
rb_vm_insns_count
is a global variable used for reporting YJIT
statistics. It is a counter that tallies the number of interpreter
instructions that have been executed, this way we can approximate how
much time we're spending in YJIT compared to the interpreter.
Unfortunately keeping this statistic means that every instruction
executed in the interpreter loop must increment the counter. Normally
this isn't a problem, but in multi-threaded situations (when Ractors are
used), incrementing this counter can become quite costly due to page
caching issues.
Additionally, since there is no locking when incrementing this global,
the count can't really make sense in a multi-threaded environment.
This commit changes rb_vm_insns_count
to a thread local. That way each
Ractor has it's own copy of the counter and incrementing the counter
becomes quite cheap. Of course this means that in multi-threaded
situations, the value doesn't really make sense (but it didn't make
sense before because of the lack of locking).
The counter is used for YJIT statistics, and since YJIT is basically
disabled when Ractors are in use, I don't think we care about
inaccuracies (for the time being). We can revisit this counter when we
give YJIT multi-threading support, but for the time being this commit
restores multi-threaded performance.
To test this, I used the benchmark in [Bug #20489].
Here is the performance on Ruby 3.2:
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 2.53 secs fish external
usr time 19.86 secs 370.00 micros 19.86 secs
sys time 0.02 secs 320.00 micros 0.02 secs
We can see the regression in performance on the master branch:
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T16:22:26Z master 4a2702dafb) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 24.87 secs fish external
usr time 195.55 secs 0.00 micros 195.55 secs
sys time 0.00 secs 716.00 micros 0.00 secs
Here are the stats after this commit:
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T20:37:06Z tl 3ef0432779) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 2.46 secs fish external
usr time 19.34 secs 381.00 micros 19.34 secs
sys time 0.01 secs 321.00 micros 0.01 secs
[Bug #20489]