Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #19875

open

Ruby 3.0 -> 3.1 Performance regression in String#count

Added by iz (Illia Zub) almost 2 years ago. Updated almost 2 years ago.

Status:

Open

Assignee:

Target version:

ruby -v:

3.2.2

Backport:

3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN

[ruby-core:114703]

Description

String#count became slower since Ruby 3.1. Originally found by @Freaky: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781

Compared using the benchmark-driver gem.

$ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6'                                                 
Calculating -------------------------------------
                          3.1.1       3.1.4       2.7.2       3.2.2       3.0.6
               count    465.804     463.741     865.783     462.711     857.395 i/s -     10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s

Comparison:
                            count
               2.7.2:       865.8 i/s 
               3.0.6:       857.4 i/s - 1.01x  slower
               3.1.1:       465.8 i/s - 1.86x  slower
               3.1.4:       463.7 i/s - 1.87x  slower
               3.2.2:       462.7 i/s - 1.87x  slower

Benchmark:

$ cat ./tmp/string_count_benchmark_driver.yml 
loop_count: 10_000
prelude: |
  html = "\nruby\n" * 1024 * 1024
benchmark:
  count: html.count($/)

Initially, I noticed the difference between str.count($/) and str.lines.size when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/

Files

Download all files

rb_str_len.fast (31.9 KB) rb_str_len.fast	rb_str_count disassembly (4001 reverted)	Freaky (Thomas Hurst), 09/15/2023 04:19 AM
rb_str_len.slow (34 KB) rb_str_len.slow	rb_str_count disassembly (master)	Freaky (Thomas Hurst), 09/15/2023 04:19 AM
revert-4001.patch (1.71 KB) revert-4001.patch	string.c 4001 reversion patch	Freaky (Thomas Hurst), 09/15/2023 04:31 AM
rb_str_count.S (11.8 KB) rb_str_count.S		nobu (Nobuyoshi Nakada), 09/15/2023 06:09 AM
bytecount.c (7.23 KB) bytecount.c	missing/bytecount.c	Freaky (Thomas Hurst), 09/19/2023 11:06 PM

Actions

Copy link

Updated by iz (Illia Zub) almost 2 years ago

Description updated (diff)

Actions

Copy link

#2 [ruby-core:114711]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Status changed from Open to Feedback

It seems compiler dependent.
My results on x86_64-darwin are:

$ ruby -I ./benchmark/benchmark-driver/lib/ benchmark/benchmark-driver/exe/benchmark-driver ~/tmp/string_count_benchmark_driver.yml --executables=/opt/local/bin/ruby{2.7,3.{0,1,2,3}}
Calculating -------------------------------------
                     /opt/local/bin/ruby2.7  /opt/local/bin/ruby3.0  /opt/local/bin/ruby3.1  /opt/local/bin/ruby3.2  /opt/local/bin/ruby3.3 
               count                171.134                 334.699                 668.206                 666.938                 666.382 i/s -     10.000k times in 58.433576s 29.877601s 14.965446s 14.993903s 15.006408s

Comparison:
                            count
/opt/local/bin/ruby3.1:       668.2 i/s 
/opt/local/bin/ruby3.2:       666.9 i/s - 1.00x  slower
/opt/local/bin/ruby3.3:       666.4 i/s - 1.00x  slower
/opt/local/bin/ruby3.0:       334.7 i/s - 2.00x  slower
/opt/local/bin/ruby2.7:       171.1 i/s - 3.90x  slower

Actions

Copy link

#3 [ruby-core:114735]

Updated by iz (Illia Zub) almost 2 years ago

Subject changed from Ruby 2.7 -> 3.1 Performance regression in String#count to Ruby 3.0 -> 3.1 Performance regression in String#count

Thanks for your feedback, @nobu (Nobuyoshi Nakada)! It looks like compiler dependent. I used x86_64-linux.

May you please help me to find the root cause? (I'll do my own research, just asking for starting points. )

Actions

Copy link

#4 [ruby-core:114736]

Updated by byroot (Jean Boussier) almost 2 years ago

May you please help me to find the root cause?

A start would be to decompile rb_str_count on two versions.

@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.

Decompiling could confirm that, however it's unclear how it could be fixed.

Actions

Copy link

#5 [ruby-core:114767]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

byroot (Jean Boussier) wrote in #note-4:

May you please help me to find the root cause?

A start would be to decompile rb_str_count on two versions.

Tried objdump --disassemble=rb_str_count on Ubuntu 22.04, but the shared library is stripped and has no symbols.

@alanwu (Alan Wu) guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.

Decompiling could confirm that, however it's unclear how it could be fixed.

Use ssize_t instead?
Or, it may depend on the variable size, not on the signedness.

Actions

Copy link Download all files

#6 [ruby-core:114769]

Updated by Freaky (Thomas Hurst) almost 2 years ago

File rb_str_len.fast rb_str_len.fast added
File rb_str_len.slow rb_str_len.slow added
File revert-4001.patch revert-4001.patch added

Annotated disassemblies attached.

Actions

Copy link

#7 [ruby-core:114771]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Freaky (Thomas Hurst) wrote in #note-6:

Annotated disassemblies attached.

Thank you.
These are all generated by the same compiler?
The "slow" version scans per WORD (= 2bytes), while the "fast" version uses DWORD (= 4bytes).

Actions

Copy link

#8 [ruby-core:114772]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

File rb_str_count.S rb_str_count.S added

Attaching the generated code by gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04).

This seems vectorizing per 16 bytes.

	movdqu	(%rdi), %xmm2	# MEM <vector(16) unsigned char> [(unsigned char *)_95], tmp383
	addq	$16, %rdi	#, ivtmp.4747

Actions

Copy link

#9 [ruby-core:114779]

Updated by Freaky (Thomas Hurst) almost 2 years ago

nobu (Nobuyoshi Nakada) wrote in #note-7:

These are all generated by the same compiler?

Yes - FreeBSD clang version 14.0.5. I also try 15.0.7 to no effect.

gcc 12.2.0 performs much better, though it too has a slightly more modest performance regression from 4001:

  './ruby.gcc12.revert test.rb' ran
    1.28 ± 0.01 times faster than './ruby.clang.revert test.rb'
    1.61 ± 0.01 times faster than './ruby.gcc12.master test.rb'
    4.38 ± 0.04 times faster than './ruby.clang.master test.rb'

The obvious solution works for both:

diff --git string.c string.c
index deeed4a12a..70644b2338 100644
--- string.c
+++ string.c
@@ -8455,6 +8455,13 @@ rb_str_count(int argc, VALUE *argv, VALUE str)
             s = RSTRING_PTR(str);
             if (!s || RSTRING_LEN(str) == 0) return INT2FIX(0);
             send = RSTRING_END(str);
+            if (RSTRING_LEN(str) < INT_MAX) {
+                i = 0;
+                while (s < send) {
+                    if (*(unsigned char*)s++ == c) i++;
+                }
+                return INT2NUM(i);
+            }
             while (s < send) {
                 if (*(unsigned char*)s++ == c) n++;
             }

  './ruby.gcc12.fix test.rb' ran
    1.04 ± 0.01 times faster than './ruby.gcc12.revert test.rb'
    1.32 ± 0.01 times faster than './ruby.clang.fix test.rb'
    1.33 ± 0.02 times faster than './ruby.revert test.rb'
    1.68 ± 0.01 times faster than './ruby.gcc12.master test.rb'
    4.57 ± 0.03 times faster than './ruby.master test.rb'

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

  './ruby.bytecount test.rb' ran
    3.73 ± 0.08 times faster than './ruby.gcc12.fix test.rb'
    3.87 ± 0.08 times faster than './ruby.gcc12.revert test.rb'
    4.94 ± 0.10 times faster than './ruby.clang.fix test.rb'
    4.96 ± 0.11 times faster than './ruby.revert test.rb'
    6.27 ± 0.13 times faster than './ruby.gcc12.master test.rb'
   17.05 ± 0.36 times faster than './ruby.master test.rb'

Who needs a compiler to vectorise for you when you can just ... copy what someone smarter than you did.

Actions

Copy link

#10

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Status changed from Feedback to Open

Actions

Copy link

#11 [ruby-core:114801]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Freaky (Thomas Hurst) wrote in #note-9:

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount

But it seems no significant difference on x86_64-darwin.

Actions

Copy link

#12 [ruby-core:114802]

Updated by Freaky (Thomas Hurst) almost 2 years ago

nobu (Nobuyoshi Nakada) wrote in #note-11:

Freaky (Thomas Hurst) wrote in #note-9:

Oh. And you know what this reminds me of? That one time I ported Rust's bytecount crate to C.

Thank you, tried with it.
https://github.com/nobu/ruby/tree/mm_bytecount

Thanks for trying it!

But it seems no significant difference on x86_64-darwin.

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.

Actions

Copy link

#13 [ruby-core:114807]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Freaky (Thomas Hurst) wrote in #note-12:

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.

There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.

You may want to discuss it with compiler porting teams.

Actions

Copy link

#14 [ruby-core:114823]

Updated by Freaky (Thomas Hurst) almost 2 years ago

File bytecount.c bytecount.c added

nobu (Nobuyoshi Nakada) wrote in #note-13:

Freaky (Thomas Hurst) wrote in #note-12:

I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.

Who/what do you mean by "we"?
It feels like a compiler's (or optimizer's) job.

If only!

I've added an AVX2 path and worked on getting runtime dispatch working on both clang and gcc. I've combined the result into a single file suitable to drop straight over the top of your missing/bytecount.c, which yields this on my Zen 3:

  './ruby test.rb' ran
   22.37 ± 1.03 times faster than './ruby.master test.rb'

And this - using exactly the same binary - on an old K8 machine which lacks AVX2 and SSE4:

  ./ruby test.rb ran
    2.97 ± 0.02 times faster than ./ruby.master test.rb

Actions

Copy link

#15 [ruby-core:114825]

Updated by ahorek (Pavel Rosický) almost 2 years ago

@Freaky here's an alternative approach https://godbolt.org/z/zWhTYv8x5

your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.

Actions

Copy link

#16 [ruby-core:114901]

Updated by Freaky (Thomas Hurst) almost 2 years ago

ahorek (Pavel Rosický) wrote in #note-15:

@Freaky here's an alternative approach https://godbolt.org/z/zWhTYv8x5

your version with attribute((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.

I think it's too much of a maintenance burden to reimplement this stuff from first principles when there's compiler support on most systems right there. It can just fall back to scalar versions where it isn't supported.

Maybe it's also doable to support AVX2/SSE2 in such cases.

Actions

Copy link

#17 [ruby-core:114903]

Updated by Freaky (Thomas Hurst) almost 2 years ago

I've updated fast-bytecount with autotools support, which detects the needed compiler features and exposes flags to disable bits of it.

Perhaps there are other places in MRI that would benefit from some of this, which would help amortize the cost somewhat.

Actions

Copy link

#18 [ruby-core:114904]

Updated by ahorek (Pavel Rosický) almost 2 years ago

I think if the platform support is sufficient, there are multiple places where this feature could be beneficial. Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions. You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.

could prepare a proof of concept PR for the Ruby repo?

also, see the previous discussion #16487 there are already other existing examples where this feature could help.

Actions

Copy link

#19 [ruby-core:114906]

Updated by Freaky (Thomas Hurst) almost 2 years ago

ahorek (Pavel Rosický) wrote in #note-18:

I think if the platform support is sufficient, there are multiple places where this feature could be beneficial.

I think the platform support covers the major bases. I'm testing a PHP build with ifunc and target support enabled, after it was disabled for FreeBSD in 2018 and nobody thought to try turning it back on.

Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions.

Yes - __attribute__((target_clones(..)) makes this particularly easy to do -- the difficulty lies in using it appropriately.

You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.

It would be interesting to see some benchmarks comparing Ruby (and other things) targetting basline x86-64, and the -v2, -v3, and -v4 profiles.

could prepare a proof of concept PR for the Ruby repo?

This is what I'm working towards, yes.

also, see the previous discussion #16487 there are already other existing examples where this feature could help.

Thanks.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like1Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like1

Project

General

Profile

Ruby

Tags

Custom queries

Bug #19875

Ruby 3.0 -> 3.1 Performance regression in String#count

Updated by iz (Illia Zub) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by iz (Illia Zub) almost 2 years ago

Updated by byroot (Jean Boussier) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by ahorek (Pavel Rosický) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago

Updated by ahorek (Pavel Rosický) almost 2 years ago

Updated by Freaky (Thomas Hurst) almost 2 years ago