Project

General

Profile

Actions

Bug #19875

open

Ruby 3.0 -> 3.1 Performance regression in String#count

Added by iz (Illia Zub) 9 months ago. Updated 8 months ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:114703]

Description

String#count became slower since Ruby 3.1. Originally found by @Freaky: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781

Compared using the benchmark-driver gem.

$ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6'                                                 
Calculating -------------------------------------
                          3.1.1       3.1.4       2.7.2       3.2.2       3.0.6
               count    465.804     463.741     865.783     462.711     857.395 i/s -     10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s

Comparison:
                            count
               2.7.2:       865.8 i/s 
               3.0.6:       857.4 i/s - 1.01x  slower
               3.1.1:       465.8 i/s - 1.86x  slower
               3.1.4:       463.7 i/s - 1.87x  slower
               3.2.2:       462.7 i/s - 1.87x  slower

Benchmark:

$ cat ./tmp/string_count_benchmark_driver.yml 
loop_count: 10_000
prelude: |
  html = "\nruby\n" * 1024 * 1024
benchmark:
  count: html.count($/)

Initially, I noticed the difference between str.count($/) and str.lines.size when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/


Files

rb_str_len.fast (31.9 KB) rb_str_len.fast rb_str_count disassembly (4001 reverted) Freaky (Thomas Hurst), 09/15/2023 04:19 AM
rb_str_len.slow (34 KB) rb_str_len.slow rb_str_count disassembly (master) Freaky (Thomas Hurst), 09/15/2023 04:19 AM
revert-4001.patch (1.71 KB) revert-4001.patch string.c 4001 reversion patch Freaky (Thomas Hurst), 09/15/2023 04:31 AM
rb_str_count.S (11.8 KB) rb_str_count.S nobu (Nobuyoshi Nakada), 09/15/2023 06:09 AM
bytecount.c (7.23 KB) bytecount.c missing/bytecount.c Freaky (Thomas Hurst), 09/19/2023 11:06 PM
Actions #1

Updated by iz (Illia Zub) 9 months ago

  • Description updated (diff)

Updated by nobu (Nobuyoshi Nakada) 9 months ago

  • Status changed from Open to Feedback

It seems compiler dependent.
My results on x86_64-darwin are:

$ ruby -I ./benchmark/benchmark-driver/lib/ benchmark/benchmark-driver/exe/benchmark-driver ~/tmp/string_count_benchmark_driver.yml --executables=/opt/local/bin/ruby{2.7,3.{0,1,2,3}}
Calculating -------------------------------------
                     /opt/local/bin/ruby2.7  /opt/local/bin/ruby3.0  /opt/local/bin/ruby3.1  /opt/local/bin/ruby3.2  /opt/local/bin/ruby3.3 
               count                171.134                 334.699                 668.206                 666.938                 666.382 i/s -     10.000k times in 58.433576s 29.877601s 14.965446s 14.993903s 15.006408s

Comparison:
                            count
/opt/local/bin/ruby3.1:       668.2 i/s 
/opt/local/bin/ruby3.2:       666.9 i/s - 1.00x  slower
/opt/local/bin/ruby3.3:       666.4 i/s - 1.00x  slower
/opt/local/bin/ruby3.0:       334.7 i/s - 2.00x  slower
/opt/local/bin/ruby2.7:       171.1 i/s - 3.90x  slower

Updated by iz (Illia Zub) 9 months ago

  • Subject changed from Ruby 2.7 -> 3.1 Performance regression in String#count to Ruby 3.0 -> 3.1 Performance regression in String#count

Thanks for your feedback, @nobu (Nobuyoshi Nakada)! It looks like compiler dependent. I used x86_64-linux.

May you please help me to find the root cause? (I'll do my own research, just asking for starting points. )

Updated by byroot (Jean Boussier) 9 months ago

May you please help me to find the root cause?

A start would be to decompile rb_str_count on two versions.

@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.

Decompiling could confirm that, however it's unclear how it could be fixed.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

byroot (Jean Boussier) wrote in #note-4:

May you please help me to find the root cause?

A start would be to decompile rb_str_count on two versions.

Tried objdump --disassemble=rb_str_count on Ubuntu 22.04, but the shared library is stripped and has no symbols.

@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.

Decompiling could confirm that, however it's unclear how it could be fixed.

Use ssize_t instead?
Or, it may depend on the variable size, not on the signedness.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

Freaky (Thomas Hurst) wrote in #note-6:

Annotated disassemblies attached.

Thank you.
These are all generated by the same compiler?
The "slow" version scans per WORD (= 2bytes), while the "fast" version uses DWORD (= 4bytes).

Updated by nobu (Nobuyoshi Nakada) 9 months ago

Attaching the generated code by gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04).

This seems vectorizing per 16 bytes.

	movdqu	(%rdi), %xmm2	# MEM <vector(16) unsigned char> [(unsigned char *)_95], tmp383
	addq	$16, %rdi	#, ivtmp.4747

Updated by Freaky (Thomas Hurst) 9 months ago

nobu (Nobuyoshi Nakada) wrote in #note-7:

These are all generated by the same compiler?

Yes - FreeBSD clang version 14.0.5. I also try 15.0.7 to no effect.

gcc 12.2.0 performs much better, though it too has a slightly more modest performance regression from 4001:

  './ruby.gcc12.revert test.rb' ran
    1.28 ± 0.01 times faster than './ruby.clang.revert test.rb'
    1.61 ± 0.01 times faster than './ruby.gcc12.master test.rb'
    4.38 ± 0.04 times faster than './ruby.clang.master test.rb'

The obvious solution works for both:

diff --git string.c string.c
index deeed4a12a..70644b2338 100644
--- string.c
+++ string.c
@@ -8455,6 +8455,13 @@ rb_str_count(int argc, VALUE *argv, VALUE str)
             s = RSTRING_PTR(str);
             if (!s || RSTRING_LEN(str) == 0) return INT2FIX(0);
             send = RSTRING_END(str);
+            if (RSTRING_LEN(str) < INT_MAX) {
+                i = 0;
+                while (s < send) {
+                    if (*(unsigned char*)s++ == c) i++;
+                }
+                return INT2NUM(i);
+            }
             while (s < send) {
                 if (*(unsigned char*)s++ == c) n++;
             }
  './ruby.gcc12.fix test.rb' ran
    1.04 ± 0.01 times faster than './ruby.gcc12.revert test.rb'
    1.32 ± 0.01 times faster than './ruby.clang.fix test.rb'
    1.33 ± 0.02 times faster than './ruby.revert test.rb'
    1.68 ± 0.01 times faster than './ruby.gcc12.master test.rb'
    4.57 ± 0.03 times faster than './ruby.master test.rb'

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

  './ruby.bytecount test.rb' ran
    3.73 ± 0.08 times faster than './ruby.gcc12.fix test.rb'
    3.87 ± 0.08 times faster than './ruby.gcc12.revert test.rb'
    4.94 ± 0.10 times faster than './ruby.clang.fix test.rb'
    4.96 ± 0.11 times faster than './ruby.revert test.rb'
    6.27 ± 0.13 times faster than './ruby.gcc12.master test.rb'
   17.05 ± 0.36 times faster than './ruby.master test.rb'

Who needs a compiler to vectorise for you when you can just ... copy what someone smarter than you did.

Actions #10

Updated by nobu (Nobuyoshi Nakada) 8 months ago

  • Status changed from Feedback to Open

Updated by nobu (Nobuyoshi Nakada) 8 months ago

Freaky (Thomas Hurst) wrote in #note-9:

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount

But it seems no significant difference on x86_64-darwin.

Updated by Freaky (Thomas Hurst) 8 months ago

nobu (Nobuyoshi Nakada) wrote in #note-11:

Freaky (Thomas Hurst) wrote in #note-9:

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount

Thanks for trying it!

But it seems no significant difference on x86_64-darwin.

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.

Updated by nobu (Nobuyoshi Nakada) 8 months ago

Freaky (Thomas Hurst) wrote in #note-12:

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.

There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.

You may want to discuss it with compiler porting teams.

Updated by Freaky (Thomas Hurst) 8 months ago

nobu (Nobuyoshi Nakada) wrote in #note-13:

Freaky (Thomas Hurst) wrote in #note-12:

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.

If only!

I've added an AVX2 path and worked on getting runtime dispatch working on both clang and gcc. I've combined the result into a single file suitable to drop straight over the top of your missing/bytecount.c, which yields this on my Zen 3:

  './ruby test.rb' ran
   22.37 ± 1.03 times faster than './ruby.master test.rb'

And this - using exactly the same binary - on an old K8 machine which lacks AVX2 and SSE4:

  ./ruby test.rb ran
    2.97 ± 0.02 times faster than ./ruby.master test.rb

Updated by ahorek (Pavel Rosický) 8 months ago

@Freaky (Thomas Hurst) here's an alternative approach https://godbolt.org/z/zWhTYv8x5

your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.

Updated by Freaky (Thomas Hurst) 8 months ago

ahorek (Pavel Rosický) wrote in #note-15:

@Freaky (Thomas Hurst) here's an alternative approach https://godbolt.org/z/zWhTYv8x5

your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.

I think it's too much of a maintenance burden to reimplement this stuff from first principles when there's compiler support on most systems right there. It can just fall back to scalar versions where it isn't supported.

Maybe it's also doable to support AVX2/SSE2 in such cases.

Updated by Freaky (Thomas Hurst) 8 months ago

I've updated fast-bytecount with autotools support, which detects the needed compiler features and exposes flags to disable bits of it.

Perhaps there are other places in MRI that would benefit from some of this, which would help amortize the cost somewhat.

Updated by ahorek (Pavel Rosický) 8 months ago

I think if the platform support is sufficient, there are multiple places where this feature could be beneficial. Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions. You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.

could prepare a proof of concept PR for the Ruby repo?

also, see the previous discussion #16487 there are already other existing examples where this feature could help.

Updated by Freaky (Thomas Hurst) 8 months ago

ahorek (Pavel Rosický) wrote in #note-18:

I think if the platform support is sufficient, there are multiple places where this feature could be beneficial.

I think the platform support covers the major bases. I'm testing a PHP build with ifunc and target support enabled, after it was disabled for FreeBSD in 2018 and nobody thought to try turning it back on.

Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions.

Yes - __attribute__((target_clones(..)) makes this particularly easy to do -- the difficulty lies in using it appropriately.

You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.

It would be interesting to see some benchmarks comparing Ruby (and other things) targetting basline x86-64, and the -v2, -v3, and -v4 profiles.

could prepare a proof of concept PR for the Ruby repo?

This is what I'm working towards, yes.

also, see the previous discussion #16487 there are already other existing examples where this feature could help.

Thanks.

Actions

Also available in: Atom PDF

Like0
Like0Like1Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like1