Project

General

Profile

Actions

Bug #18572

closed

Performance regression when invoking refined methods

Added by palkan (Vladimir Dementyev) almost 3 years ago. Updated over 1 year ago.


Description

Since Ruby 3.0, defining a refinement for a method slows down its execution even if we do not activate the refinement:

require "benchmark_driver"

source = <<~RUBY
class Hash
  def symbolize_keys
    transform_keys { |key| key.to_sym rescue key }
  end
  def refined_symbolize_keys
    transform_keys { |key| key.to_sym rescue key }
  end
end
module HashRefinements
  refine Hash do
    def refined_symbolize_keys
      raise "never called"
    end
  end
end
HASH = {foo: 1, bar: 2, baz: 3}
class Foo
  def original
  end
  def refined
  end
end
module FooRefinements
  refine Foo do
    def refined
      raise "never called"
    end
  end
end
FOO = Foo.new
RUBY

Benchmark.driver do |x|
  x.prelude %Q{
    #{source}
  }
  x.report "#symbolize_keys original", %{ HASH.symbolize_keys }
  x.report "#symbolize_keys refined", %{ HASH.refined_symbolize_keys }
end

Benchmark.driver do |x|
  x.prelude %Q{
    #{source}
  }
  x.report "no-op original", %{ FOO.original }
  x.report "no-op refined", %{ FOO.refined }
end

The results for Ruby 3.1:

...

Comparison:
#symbolize_keys original:   2372420.1 i/s 
 #symbolize_keys refined:   1941019.0 i/s - 1.22x  slower

...

Comparison:
      no-op original:  51790974.2 i/s 
       no-op refined:  14456518.9 i/s - 3.58x  slower

For Ruby 2.6 and 2.7:

Comparison:
#symbolize_keys original:   2278339.7 i/s 
 #symbolize_keys refined:   2264153.1 i/s - 1.01x  slower

...

Comparison:
       no-op refined:  64178338.5 i/s 
      no-op original:  63357980.1 i/s - 1.01x  slower

You can find the full code and more results in this gist.

P.S. The problem was originally noticed by @byroot (Jean Boussier), see https://github.com/ruby-i18n/i18n/pull/573


Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #14083: Refinement in block calling incorrect methodRejectedActions

Updated by shugo (Shugo Maeda) almost 3 years ago

  • Status changed from Open to Assigned
  • Assignee set to ko1 (Koichi Sasada)

It seems that the performance regression was introduced by https://github.com/ruby/ruby/commit/b9007b6c548f91e88fd3f2ffa23de740431fa969

$ cat test.rb
class Foo
  def original
  end
  def refined
  end
end
module FooRefinements
  refine Foo do
    def refined
      raise "never called"
    end
  end
end
FOO = Foo.new

t = Time.now
100000.times do
  FOO.refined
end
if Time.now - t > 0.007
  puts "slow"
  exit 1
else
  puts "fast"
  exit 0
end
$ rubyfarm-bisect -g 537a1cd5a97a8c5e93b64851abaab42812506f66 -b 546730b76b41b142240891cd1bbd7df7990d5239 -t
(snip)
b9007b6c548f91e88fd3f2ffa23de740431fa969 is the first bad commit
commit b9007b6c548f91e88fd3f2ffa23de740431fa969
Author: Koichi Sasada <ko1@atdot.net>
Date:   Wed Jan 8 16:14:01 2020 +0900

    Introduce disposable call-cache.

    This patch contains several ideas:

    (1) Disposable inline method cache (IMC) for race-free inline method cache
        * Making call-cache (CC) as a RVALUE (GC target object) and allocate new
          CC on cache miss.
        * This technique allows race-free access from parallel processing
          elements like RCU.
    (2) Introduce per-Class method cache (pCMC)
        * Instead of fixed-size global method cache (GMC), pCMC allows flexible
          cache size.
        * Caching CCs reduces CC allocation and allow sharing CC's fast-path
          between same call-info (CI) call-sites.
    (3) Invalidate an inline method cache by invalidating corresponding method
        entries (MEs)
        * Instead of using class serials, we set "invalidated" flag for method
          entry itself to represent cache invalidation.
        * Compare with using class serials, the impact of method modification
          (add/overwrite/delete) is small.
        * Updating class serials invalidate all method caches of the class and
          sub-classes.
        * Proposed approach only invalidate the method cache of only one ME.

    See [Feature #16614] for more details.

 class.c                                   |  45 +-
 common.mk                                 |   1 +
 compile.c                                 |  38 +-
 debug_counter.h                           |  92 ++--
 eval.c                                    |   2 +-
 ext/objspace/objspace.c                   |   1 +
 gc.c                                      | 204 +++++++-
 id_table.c                                |   2 +-
 insns.def                                 |  13 +-
 internal/class.h                          |   2 +
 internal/imemo.h                          |   4 +
 internal/vm.h                             |  41 +-
 iseq.c                                    |  17 +
 method.h                                  |  11 +-
 mjit.c                                    |  19 +-
 mjit.h                                    |  29 ++
 mjit_compile.c                            |  42 +-
 mjit_worker.c                             |  30 +-
 test/-ext-/tracepoint/test_tracepoint.rb  |  12 +-
 test/ruby/test_gc.rb                      |   3 +
 test/ruby/test_inlinecache.rb             |  64 +++
 tool/mk_call_iseq_optimized.rb            |   2 +-
 tool/ruby_vm/views/_mjit_compile_send.erb |  23 +-
 tool/ruby_vm/views/mjit_compile.inc.erb   |   2 +-
 vm.c                                      |  26 +-
 vm_callinfo.h                             | 235 ++++++++-
 vm_core.h                                 |   3 +-
 vm_dump.c                                 |   4 +-
 vm_eval.c                                 |  50 +-
 vm_insnhelper.c                           | 814 ++++++++++++++++--------------
 vm_insnhelper.h                           |  15 +-
 vm_method.c                               | 630 ++++++++++++++---------
 32 files changed, 1606 insertions(+), 870 deletions(-)
 create mode 100644 test/ruby/test_inlinecache.rb
bisect run success

Updated by palkan (Vladimir Dementyev) over 1 year ago

Checked Ruby 3.2—better than Ruby 3.1 but still noticeable:

Comparison:
    #symbolize_keys original:   4362026.0 i/s 
     #symbolize_keys refined:   3999755.6 i/s - 1.09x  slower
 

Comparison:
      no-op original:  80631273.5 i/s 
       no-op refined:  29181553.6 i/s - 2.76x  slower

Updated by Eregon (Benoit Daloze) over 1 year ago

Interesting, maybe this is partly the cause for the 13 seconds mentioned in https://shopify.engineering/the-case-against-monkey-patching.

Maybe we should deprecate refinements: my main concern about them is it's a ton of complexity in Ruby implementations (in method lookup and even worse in super method lookup) and it's not really used much or gains much, so it does not seem worth it. It does make every uncached/cache-filling method lookup slower.

It shouldn't make cached method lookup slower though when no refinements are active in that scope, at least it does not currently in TruffleRuby.

For context-independent JITed code there is no clear solution when using refinements, i.e., it might just be uncached lookups (i.e., several hashmap lookups) for calls affected by refinements (calls after using), which is terrible performance-wise. Maybe there is a complex solution to fix it but I'm not sure it's worth the complexity and it of course increases further the maintenance cost of refinements.

Updated by palkan (Vladimir Dementyev) over 1 year ago

Eregon (Benoit Daloze) wrote in #note-3:

Interesting, maybe this is partly the cause for the 13 seconds mentioned in https://shopify.engineering/the-case-against-monkey-patching.

As far as I understand, this line is responsible for "the 13 seconds" boot time overhead: https://github.com/ruby/ruby/blob/master/eval.c#L1342 (I was able to achieve similar results by adding tons of using to the source code). The line was added in the same commit as mentioned above, so the root cause is the same for sure.

Maybe we should deprecate refinements: my main concern about them is it's a ton of complexity in Ruby implementations (in method lookup and even worse in super method lookup) and it's not really used much or gains much, so it does not seem worth it. It does make every uncached/cache-filling method lookup slower.
It shouldn't make cached method lookup slower though when no refinements are active in that scope, at least it does not currently in TruffleRuby.

But that wasn't the case in Ruby <3; refined methods were fast (b/c cached), and had no effect on the original method calls, like today. I believe, something just got lost along the introduction of disposable call-caches.

The problem with "it's not really used much or gains much" is that it took many years for refinements to reach some sort of stability (2.6 and 2.7 releases were pretty good) and finally developers were convinced not to be afraid of the "experimentality" of the feature, and with Ruby 3 we made a huge step back. Especially now, when Shopify engineers claim that "refinements are really slow!" (which is only partially true), we, for sure, should expect the community to turn backs on refinements, again.

It's a two-edged sword 🤷‍♂️

Updated by Eregon (Benoit Daloze) over 1 year ago

palkan (Vladimir Dementyev) wrote in #note-4:

As far as I understand, this line is responsible for "the 13 seconds" boot time overhead: https://github.com/ruby/ruby/blob/master/eval.c#L1342 (I was able to achieve similar results by adding tons of using to the source code). The line was added in the same commit as mentioned above, so the root cause is the same for sure.

Oh yeah, good find, clearing all method lookup caches on every call to using with ObjectSpace.each_object is terrible for performance. It likely throws away all JITed code too.
A bit like extend used to blow up every method lookup inline cache a long time ago.
I guess we need to wait for @ko1 (Koichi Sasada) to fix that.

This is probably fixable but might require some changes to the overall method lookup design, which is then quite involved (from looking at similar things in TruffleRuby).
My concern above is mainly the whole complexity of refinements in Ruby implementations and some parts of the semantics are/were unsound (e.g. super in refinements can cause infinite loops, this is being addressed by preventing include/prepend in a refinement and replacing by import_methods, a welcome simplification).

Updated by alanwu (Alan Wu) over 1 year ago

Yes, the perf regression is unfortunate.

The current per-call-site cache (CC) in the interpreter is keyed on
[recever_class, call_target_method_entry_validity] -- the lexical scope that
is used to resolve refinement definition for each call site is not cached. The
way the system resolves calls to refinement definitions is through a special
type of method entry (VM_METHOD_TYPE_REFINED) that triggers a secondary
uncached lookup which takes into account the lexical scope of the call site. Any
refinement definition for a module/class M, even ones that are never
activated through Module#using, makes all calls to that method name in M
do this uncached secondary lookup. This is the issue the OP showed in the
no-op {original,refined} benchmark pair.

The fact that there is a full heap walk during every call to Module#using is
a separate issue. If you believe what I said about the secondary lookup always
happening for refined method, and believe that the secondary lookup is always
uncached, then this heap walk to invalidate all CCs should be unnecessary.
This was what I noticed when I sent https://github.com/ruby/ruby/pull/4323 in
2021. The change itself is simple and we can merge it if we can get consensus
that my assertions are correct.

Reworking the cache system to fix the regression for the benchmark in the OP
would be much more invovled.

Maybe we should deprecate refinements: my main concern about them is it's a
ton of complexity in Ruby implementations ...

Their semantics definitely make the minimum required implementation complexity fairly
high, and make cache design tricky. Even the current relatively simple cache
design had bugs (e.g. #17806) when it first landed. I don't think they should
be deprecated, though, since they've been around long enough that removing them
would make many users unhappy. However, I would advise against introducing any new
additional complications to their semantics (e.g. #16461, #19277). We've already seen
how this kind of complexity can make it more challenging for new developments like
Ractors and YJIT.

Actions #8

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Related to Bug #14083: Refinement in block calling incorrect method added

Updated by Eregon (Benoit Daloze) over 1 year ago

From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.

Updated by shugo (Shugo Maeda) over 1 year ago

Eregon (Benoit Daloze) wrote in #note-9:

From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.

If it's needed to fix the performance issue, it may be acceptable to raise exceptions in some cases (but I don't know the exact condition of such cases).
@ko1 (Koichi Sasada), @matz (Yukihiro Matsumoto) What do you think?

It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.

Updated by Eregon (Benoit Daloze) over 1 year ago

shugo (Shugo Maeda) wrote in #note-10:

It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.

It would still be possible to use using outside the module_eval (e.g. in a module body or at top-level), which seems more correct anyway since refinements should be per constant scope/cref, they are not per block.

Updated by shugo (Shugo Maeda) over 1 year ago

shugo (Shugo Maeda) wrote in #note-10:

Eregon (Benoit Daloze) wrote in #note-9:

From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.

If it's needed to fix the performance issue, it may be acceptable to raise exceptions in some cases (but I don't know the exact condition of such cases).
@ko1 (Koichi Sasada), @matz (Yukihiro Matsumoto) What do you think?

It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.

I and @ko1 (Koichi Sasada) discussed it at today's developer meeting, and concluded that there is no need to change the behavior to fix the performance regression.
@Eregon (Benoit Daloze) Could you create another issue if you would like to change the behavior?

Actions #13

Updated by ko1 (Koichi Sasada) over 1 year ago

  • Status changed from Assigned to Closed

Applied in changeset git|cfd7729ce7a31c8b6ec5dd0e99c67b2932de4732.


use inline cache for refinements

From Ruby 3.0, refined method invocations are slow because
resolved methods are not cached by inline cache because of
conservertive strategy. However, using clears all caches
so that it seems safe to cache resolved method entries.

This patch caches resolved method entries in inline cache
and clear all of inline method caches when using is called.

fix [Bug #18572]

 # without refinements

class C
  def foo = :C
end

N = 1_000_000

obj = C.new
require 'benchmark'
Benchmark.bm{|x|
  x.report{N.times{
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
  }}
}

_END__
              user     system      total        real
master    0.362859   0.002544   0.365403 (  0.365424)
modified  0.357251   0.000000   0.357251 (  0.357258)
 # with refinment but without using

class C
  def foo = :C
end

module R
  refine C do
    def foo = :R
  end
end

N = 1_000_000

obj = C.new
require 'benchmark'
Benchmark.bm{|x|
  x.report{N.times{
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
  }}
}
__END__
               user     system      total        real
master     0.957182   0.000000   0.957182 (  0.957212)
modified   0.359228   0.000000   0.359228 (  0.359238)
 # with using

class C
  def foo = :C
end

module R
  refine C do
    def foo = :R
  end
end

N = 1_000_000

using R

obj = C.new
require 'benchmark'
Benchmark.bm{|x|
  x.report{N.times{
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
    obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
  }}
}
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0