Project

General

Profile

Actions

Bug #20489

closed

Ractor behavior strange in ruby master

Added by nekoyama32767 (Jinsong Yu) 9 months ago. Updated 25 days ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 3.4.0dev (2024-05-14T01:58:31Z master 9d01f657b3) [x86_64-linux]
[ruby-dev:52086]

Description

This is a tarai program
Run./ruby tarai_ractor.rb 2 8 is to use 2 thread to run 8 times tarai function total, that means 4 times tarai for each ractor(thread).

GC.disable

def split_len(len, split)
    ret = []
    mod = len % split
    head = 0
    tail = 0
    split.times do |i|
        if head >= len
            break
        end
        k = 0
        if i < mod then k = 1 end
        tail = tail + (len/split) + k
        ret.append(head...tail)
        head = tail
    end
    return ret
end

def ary_split(ary, split)
    return split_len(ary.length,split)
end
def item_check(item)
    if item[0] != nil
        1 + item_check(item[0]) + item_check(item[1])
    else
        1
    end
end


def tarai(x, y, z) =
    x <= y ? y : tarai(tarai(x-1, y, z),
                       tarai(y-1, z, x),
                       tarai(z-1, x, y))
  
times = ARGV[0].to_i
split = ARGV[1].to_i

p split_len(times, split)
split_len(times, split).each.map do |sp|
    Ractor.new (sp) {
        s = _1
        s.each do 
         tarai(13, 7, 0) 
        end
    }
end.each(&:take)

The problem is in ruby 3.1.2 and ruby 3.3
./ruby tarai_ractor.rb 1 1 has simiular execute time with ./ruby tarai_ractor.rb 8 8 because each thread only run 1 time of tarai function, like follow:

ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:

time ruby exp_ractor_tarai.rb 1 1
[0...1]
<internal:ractor>:267: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

real    0m1.442s
user    0m1.429s
sys     0m0.014s

time ruby exp_ractor_tarai.rb 8 8
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
<internal:ractor>:267: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

real    0m1.857s
user    0m13.817s
sys     0m0.041s

But in ruby master(ruby 3.4.0dev)
ruby 3.4.0dev (2024-05-14T01:58:31Z master 9d01f657b3) [x86_64-linux]
1 ractor 1 tarai:

time ../ruby exp_ractor_tarai.rb 1 1

`RubyGems' were not loaded.
`error_highlight' was not loaded.
`did_you_mean' was not loaded.
`syntax_suggest' was not loaded.
[0...1]
exp_ractor_tarai.rb:47: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

real    0m1.671s
user    0m1.666s
sys     0m0.005s

8 ractor 8 tarai:

time ../ruby exp_ractor_tarai.rb 8 8

`RubyGems' were not loaded.
`error_highlight' was not loaded.
`did_you_mean' was not loaded.
`syntax_suggest' was not loaded.
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
exp_ractor_tarai.rb:47: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

real    0m18.408s
user    1m58.659s
sys     0m0.021s

And in ruby 3.4.0dev when run time ../ruby exp_ractor_tarai.rb 16 16 16 thread should be used in system monitoring while only 8 threads are used.
Ruby 3.3 and Ruby 3.1.2 do not have this problem.


Files

thead16_16.png (168 KB) thead16_16.png nekoyama32767 (Jinsong Yu), 05/14/2024 08:59 AM
thread16_8.png (165 KB) thread16_8.png nekoyama32767 (Jinsong Yu), 05/14/2024 09:00 AM
Screenshot 2025-01-08 at 4.02.23 PM.png (126 KB) Screenshot 2025-01-08 at 4.02.23 PM.png Graph of performance difference tenderlovemaking (Aaron Patterson), 01/09/2025 12:02 AM

Updated by ko1 (Koichi Sasada) 9 months ago

Thank you. I'll investigate it.

Actions #2

Updated by hsbt (Hiroshi SHIBATA) 5 months ago

  • Status changed from Open to Assigned
  • Assignee set to ko1 (Koichi Sasada)
Actions #3

Updated by luke-gru (Luke Gruber) 27 days ago

For me the runtime goes up a lot when I increase # of ractors when VM_CHECK_MODE > 0, but in non-debug mode it runs well. Also, right now the native thread # used for non-main ractors is capped at 8 because we don't actually probe the OS for the number of cores available. Were you running a debug build?

Actions #4

Updated by tenderlovemaking (Aaron Patterson) 27 days ago

  • Status changed from Assigned to Feedback

I also can't seem to reproduce this:

[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ruby -v test.rb 8 8
ruby 3.3.4 (2024-07-16 revision 425e468d25) [arm64-darwin23]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    5.97 secs    fish           external
   usr time   28.54 secs   78.00 micros   28.54 secs
   sys time    0.14 secs  736.00 micros    0.14 secs

[aaron@tc-lan-adapter ~/g/ruby (master)]$ chruby ruby-3.4.1
[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ruby -v test.rb 8 8
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    6.42 secs    fish           external
   usr time   28.45 secs    0.09 millis   28.45 secs
   sys time    0.09 secs    1.33 millis    0.08 secs

[aaron@tc-lan-adapter ~/g/ruby (master)]$ time ./ruby -v test.rb 8 8
ruby 3.5.0dev (2025-01-08T20:42:35Z master 96f23306f0) +PRISM [arm64-darwin24]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    6.04 secs    fish           external
   usr time   28.74 secs    0.10 millis   28.74 secs
   sys time    0.10 secs    1.31 millis    0.10 secs

Actions #5

Updated by tenderlovemaking (Aaron Patterson) 27 days ago

Sorry, I'm totally wrong. I misread the reproduction steps. I can reproduce this problem.

The way I tested is with this command on different Ruby tags:

hyperfine './miniruby -v ../test.rb 1 1' './miniruby -v ../test.rb 8 8'

I've attached a screenshot of a graph that compares performance. It looks like on Ruby 3.1.6 and 3.2.4, tarai(8, 8) is about ~40% slower, but on Ruby 3.3.6 it's about 10x slower, then on Ruby 3.4.1 it's about 5.5x slower. I will see if I can find the bottleneck.

Actions #6

Updated by tenderlovemaking (Aaron Patterson) 27 days ago

It looks like the VM loop is much slower in Ruby 3.5 than Ruby 3.2:

instruments screenshot

Actions #7

Updated by tenderlovemaking (Aaron Patterson) 27 days ago

The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.

Actions #8

Updated by nekoyama32767 (Jinsong Yu) 27 days ago · Edited

tenderlovemaking (Aaron Patterson) wrote in #note-7:

The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.

Replace tarai with a very long while loop(like while i < 100000000; i+=1; end;) can also produce same behavior. I also tried to profiling this loop version. VM loops take much more time when more than 2 ractors are running.

Actions #9

Updated by nekoyama32767 (Jinsong Yu) 27 days ago

tenderlovemaking (Aaron Patterson) wrote in #note-7:

The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.

Comment out rb_vm_insns_count increasment in vm_insnhelper.h can fix this problem. I think that this increasment may cause cpu cache miss, and finally slow down the VM loops.

Actions #10

Updated by nekoyama32767 (Jinsong Yu) 27 days ago

tenderlovemaking (Aaron Patterson) wrote in #note-7:

The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.

Actually, on a x86_64 linux virtual machine, 3.4 is ~14x slower. But I don't know why 3.4 become faster than 3.3 on arm64

Actions #11

Updated by tenderlovemaking (Aaron Patterson) 26 days ago

nekoyama32767 (Jinsong Yu) wrote in #note-10:

tenderlovemaking (Aaron Patterson) wrote in #note-7:

The regression in Ruby 3.3 came from this commit. It seems like rb_vm_insns_count is still in the master branch, so it could be impacting speed on 3.4 / 3.5, but I'm not sure. 3.4 is faster than 3.3, but not as fast as 3.2.

Actually, on a x86_64 linux virtual machine, 3.4 is ~14x slower. But I don't know why 3.4 become faster than 3.3 on arm64

Yes, I'm seeing the same thing on my x86 machine.

I tried changing the counter to a thread local, and that seems to improve speed on my x86.

Here is before:

$ time ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-08T20:42:35Z master 96f23306f0) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in   25.30 secs    fish           external
   usr time  152.10 secs  265.00 micros  152.10 secs
   sys time    0.01 secs  226.00 micros    0.01 secs

Here is after:

$ time ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-09T19:03:26Z tl 5bfb7753e3) +PRISM [x86_64-linux]
last_commit=move insn counting to a thread local
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    4.70 secs    fish           external
   usr time   19.08 secs  252.00 micros   19.08 secs
   sys time    0.00 secs  210.00 micros    0.00 secs

Unfortunately, it's still not as fast as tarai(1, 1):

$ time ./miniruby -v ../test.rb 1 1
ruby 3.5.0dev (2025-01-09T19:03:26Z tl 5bfb7753e3) +PRISM [x86_64-linux]
last_commit=move insn counting to a thread local
[0...1]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    2.30 secs    fish           external
   usr time    2.30 secs    8.34 millis    2.30 secs
   sys time    0.01 secs    4.17 millis    0.00 secs

This counter is for YJIT statistics, but from what I understand, YJIT is basically disabled when Ractors are enabled. To me, this means that putting the counter in a thread local is an acceptable change.

I will clean up the patch I made and then ask other people on the YJIT team about it.

Actions #12

Updated by luke-gru (Luke Gruber) 26 days ago

Ah I was running a non-yjit build. Good work guys, that's a weird bug for sure. The thread-local wouldn't work with ractors because native threads can run different ractors but if YJIT is disabled when you start a Ractor I don't see a problem with the solution, like you said.

Actions #13

Updated by nekoyama32767 (Jinsong Yu) 26 days ago

Unfortunately, it's still not as fast as tarai(1, 1):

This may cause by a native thread and ractor thread mapping setting. Set RUBY_MAX_CPU to a larger number, in my 12 physic core environment, result is as faster as tarai 1 1

Actions #14

Updated by tenderlovemaking (Aaron Patterson) 25 days ago

  • Status changed from Assigned to Closed

Applied in changeset git|50c2c4bddee5a2c5edc0d9fc0f635befdfc4bacc.


Make rb_vm_insns_count a thread local variable

rb_vm_insns_count is a global variable used for reporting YJIT
statistics. It is a counter that tallies the number of interpreter
instructions that have been executed, this way we can approximate how
much time we're spending in YJIT compared to the interpreter.

Unfortunately keeping this statistic means that every instruction
executed in the interpreter loop must increment the counter. Normally
this isn't a problem, but in multi-threaded situations (when Ractors are
used), incrementing this counter can become quite costly due to page
caching issues.

Additionally, since there is no locking when incrementing this global,
the count can't really make sense in a multi-threaded environment.

This commit changes rb_vm_insns_count to a thread local. That way each
Ractor has it's own copy of the counter and incrementing the counter
becomes quite cheap. Of course this means that in multi-threaded
situations, the value doesn't really make sense (but it didn't make
sense before because of the lack of locking).

The counter is used for YJIT statistics, and since YJIT is basically
disabled when Ractors are in use, I don't think we care about
inaccuracies (for the time being). We can revisit this counter when we
give YJIT multi-threading support, but for the time being this commit
restores multi-threaded performance.

To test this, I used the benchmark in [Bug #20489].

Here is the performance on Ruby 3.2:

$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    2.53 secs    fish           external
   usr time   19.86 secs  370.00 micros   19.86 secs
   sys time    0.02 secs  320.00 micros    0.02 secs

We can see the regression in performance on the master branch:

$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T16:22:26Z master 4a2702dafb) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in   24.87 secs    fish           external
   usr time  195.55 secs    0.00 micros  195.55 secs
   sys time    0.00 secs  716.00 micros    0.00 secs

Here are the stats after this commit:

$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T20:37:06Z tl 3ef0432779) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.

________________________________________________________
Executed in    2.46 secs    fish           external
   usr time   19.34 secs  381.00 micros   19.34 secs
   sys time    0.01 secs  321.00 micros    0.01 secs

[Bug #20489]

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0