Bug #19875
openRuby 3.0 -> 3.1 Performance regression in String#count
Description
String#count
became slower since Ruby 3.1. Originally found by @Freaky
: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781
Compared using the benchmark-driver
gem.
$ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6'
Calculating -------------------------------------
3.1.1 3.1.4 2.7.2 3.2.2 3.0.6
count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s
Comparison:
count
2.7.2: 865.8 i/s
3.0.6: 857.4 i/s - 1.01x slower
3.1.1: 465.8 i/s - 1.86x slower
3.1.4: 463.7 i/s - 1.87x slower
3.2.2: 462.7 i/s - 1.87x slower
Benchmark:
$ cat ./tmp/string_count_benchmark_driver.yml
loop_count: 10_000
prelude: |
html = "\nruby\n" * 1024 * 1024
benchmark:
count: html.count($/)
Initially, I noticed the difference between str.count($/)
and str.lines.size
when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/
Files
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
- Status changed from Open to Feedback
It seems compiler dependent.
My results on x86_64-darwin are:
$ ruby -I ./benchmark/benchmark-driver/lib/ benchmark/benchmark-driver/exe/benchmark-driver ~/tmp/string_count_benchmark_driver.yml --executables=/opt/local/bin/ruby{2.7,3.{0,1,2,3}}
Calculating -------------------------------------
/opt/local/bin/ruby2.7 /opt/local/bin/ruby3.0 /opt/local/bin/ruby3.1 /opt/local/bin/ruby3.2 /opt/local/bin/ruby3.3
count 171.134 334.699 668.206 666.938 666.382 i/s - 10.000k times in 58.433576s 29.877601s 14.965446s 14.993903s 15.006408s
Comparison:
count
/opt/local/bin/ruby3.1: 668.2 i/s
/opt/local/bin/ruby3.2: 666.9 i/s - 1.00x slower
/opt/local/bin/ruby3.3: 666.4 i/s - 1.00x slower
/opt/local/bin/ruby3.0: 334.7 i/s - 2.00x slower
/opt/local/bin/ruby2.7: 171.1 i/s - 3.90x slower
Updated by iz (Illia Zub) about 1 year ago
- Subject changed from Ruby 2.7 -> 3.1 Performance regression in String#count to Ruby 3.0 -> 3.1 Performance regression in String#count
Thanks for your feedback, @nobu (Nobuyoshi Nakada)! It looks like compiler dependent. I used x86_64-linux
.
May you please help me to find the root cause? (I'll do my own research, just asking for starting points. )
Updated by byroot (Jean Boussier) about 1 year ago
May you please help me to find the root cause?
A start would be to decompile rb_str_count
on two versions.
@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.
Decompiling could confirm that, however it's unclear how it could be fixed.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
byroot (Jean Boussier) wrote in #note-4:
May you please help me to find the root cause?
A start would be to decompile
rb_str_count
on two versions.
Tried objdump --disassemble=rb_str_count
on Ubuntu 22.04, but the shared library is stripped and has no symbols.
@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.
Decompiling could confirm that, however it's unclear how it could be fixed.
Use ssize_t
instead?
Or, it may depend on the variable size, not on the signedness.
Updated by Freaky (Thomas Hurst) about 1 year ago
- File rb_str_len.fast rb_str_len.fast added
- File rb_str_len.slow rb_str_len.slow added
- File revert-4001.patch revert-4001.patch added
Annotated disassemblies attached.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
Freaky (Thomas Hurst) wrote in #note-6:
Annotated disassemblies attached.
Thank you.
These are all generated by the same compiler?
The "slow" version scans per WORD (= 2bytes), while the "fast" version uses DWORD (= 4bytes).
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
- File rb_str_count.S rb_str_count.S added
Attaching the generated code by gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04).
This seems vectorizing per 16 bytes.
movdqu (%rdi), %xmm2 # MEM <vector(16) unsigned char> [(unsigned char *)_95], tmp383
addq $16, %rdi #, ivtmp.4747
Updated by Freaky (Thomas Hurst) about 1 year ago
nobu (Nobuyoshi Nakada) wrote in #note-7:
These are all generated by the same compiler?
Yes - FreeBSD clang version 14.0.5. I also try 15.0.7 to no effect.
gcc 12.2.0 performs much better, though it too has a slightly more modest performance regression from 4001:
'./ruby.gcc12.revert test.rb' ran
1.28 ± 0.01 times faster than './ruby.clang.revert test.rb'
1.61 ± 0.01 times faster than './ruby.gcc12.master test.rb'
4.38 ± 0.04 times faster than './ruby.clang.master test.rb'
The obvious solution works for both:
diff --git string.c string.c
index deeed4a12a..70644b2338 100644
--- string.c
+++ string.c
@@ -8455,6 +8455,13 @@ rb_str_count(int argc, VALUE *argv, VALUE str)
s = RSTRING_PTR(str);
if (!s || RSTRING_LEN(str) == 0) return INT2FIX(0);
send = RSTRING_END(str);
+ if (RSTRING_LEN(str) < INT_MAX) {
+ i = 0;
+ while (s < send) {
+ if (*(unsigned char*)s++ == c) i++;
+ }
+ return INT2NUM(i);
+ }
while (s < send) {
if (*(unsigned char*)s++ == c) n++;
}
'./ruby.gcc12.fix test.rb' ran
1.04 ± 0.01 times faster than './ruby.gcc12.revert test.rb'
1.32 ± 0.01 times faster than './ruby.clang.fix test.rb'
1.33 ± 0.02 times faster than './ruby.revert test.rb'
1.68 ± 0.01 times faster than './ruby.gcc12.master test.rb'
4.57 ± 0.03 times faster than './ruby.master test.rb'
Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.
'./ruby.bytecount test.rb' ran
3.73 ± 0.08 times faster than './ruby.gcc12.fix test.rb'
3.87 ± 0.08 times faster than './ruby.gcc12.revert test.rb'
4.94 ± 0.10 times faster than './ruby.clang.fix test.rb'
4.96 ± 0.11 times faster than './ruby.revert test.rb'
6.27 ± 0.13 times faster than './ruby.gcc12.master test.rb'
17.05 ± 0.36 times faster than './ruby.master test.rb'
Who needs a compiler to vectorise for you when you can just ... copy what someone smarter than you did.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
- Status changed from Feedback to Open
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
Freaky (Thomas Hurst) wrote in #note-9:
Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.
Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount
But it seems no significant difference on x86_64-darwin.
Updated by Freaky (Thomas Hurst) about 1 year ago
nobu (Nobuyoshi Nakada) wrote in #note-11:
Freaky (Thomas Hurst) wrote in #note-9:
Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.
Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount
Thanks for trying it!
But it seems no significant difference on x86_64-darwin.
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.
There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
Freaky (Thomas Hurst) wrote in #note-12:
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.
Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.
There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.
You may want to discuss it with compiler porting teams.
Updated by Freaky (Thomas Hurst) about 1 year ago
- File bytecount.c bytecount.c added
nobu (Nobuyoshi Nakada) wrote in #note-13:
Freaky (Thomas Hurst) wrote in #note-12:
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.
Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.
If only!
I've added an AVX2 path and worked on getting runtime dispatch working on both clang and gcc. I've combined the result into a single file suitable to drop straight over the top of your missing/bytecount.c
, which yields this on my Zen 3:
'./ruby test.rb' ran
22.37 ± 1.03 times faster than './ruby.master test.rb'
And this - using exactly the same binary - on an old K8 machine which lacks AVX2 and SSE4:
./ruby test.rb ran
2.97 ± 0.02 times faster than ./ruby.master test.rb
Updated by ahorek (Pavel Rosický) about 1 year ago
@Freaky (Thomas Hurst) here's an alternative approach https://godbolt.org/z/zWhTYv8x5
your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.
Updated by Freaky (Thomas Hurst) about 1 year ago
ahorek (Pavel Rosický) wrote in #note-15:
@Freaky (Thomas Hurst) here's an alternative approach https://godbolt.org/z/zWhTYv8x5
your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.
I think it's too much of a maintenance burden to reimplement this stuff from first principles when there's compiler support on most systems right there. It can just fall back to scalar versions where it isn't supported.
Maybe it's also doable to support AVX2/SSE2 in such cases.
Updated by Freaky (Thomas Hurst) about 1 year ago
I've updated fast-bytecount with autotools support, which detects the needed compiler features and exposes flags to disable bits of it.
Perhaps there are other places in MRI that would benefit from some of this, which would help amortize the cost somewhat.
Updated by ahorek (Pavel Rosický) about 1 year ago
I think if the platform support is sufficient, there are multiple places where this feature could be beneficial. Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions. You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.
could prepare a proof of concept PR for the Ruby repo?
also, see the previous discussion #16487 there are already other existing examples where this feature could help.
Updated by Freaky (Thomas Hurst) about 1 year ago
ahorek (Pavel Rosický) wrote in #note-18:
I think if the platform support is sufficient, there are multiple places where this feature could be beneficial.
I think the platform support covers the major bases. I'm testing a PHP build with ifunc and target support enabled, after it was disabled for FreeBSD in 2018 and nobody thought to try turning it back on.
Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions.
Yes - __attribute__((target_clones(..))
makes this particularly easy to do -- the difficulty lies in using it appropriately.
You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.
It would be interesting to see some benchmarks comparing Ruby (and other things) targetting basline x86-64, and the -v2, -v3, and -v4 profiles.
could prepare a proof of concept PR for the Ruby repo?
This is what I'm working towards, yes.
also, see the previous discussion #16487 there are already other existing examples where this feature could help.
Thanks.