Project

General

Profile

Actions

Feature #11768

closed

Add a polymorphic inline cache

Added by tenderlovemaking (Aaron Patterson) over 8 years ago. Updated about 7 years ago.

Status:
Closed
Target version:
-
[ruby-core:71815]

Description

Hi,

I've attached a patch that adds a PIC to the existing Mono IC struct.

I haven't run every benchmark that's checked in, but this patch speeds up the polymorphic call benchmark by about 20%. Here is the benchmark before my patch:

[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m3.244s
user	0m3.154s
sys	0m0.044s
[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m3.158s
user	0m3.090s
sys	0m0.042s
[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m3.162s
user	0m3.099s
sys	0m0.039s

Here it is with my patch applied:

[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m2.522s
user	0m2.455s
sys	0m0.044s
[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m2.515s
user	0m2.458s
sys	0m0.035s
[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_poly_method.rb 

real	0m2.637s
user	0m2.545s
sys	0m0.045s

Monomorhic call sites maintain the same performance:

Before:

[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.416s
user	0m1.371s
sys	0m0.032s
[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.456s
user	0m1.402s
sys	0m0.032s
[aaron@TC ruby (trunk)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.420s
user	0m1.372s
sys	0m0.032s

After:

[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.451s
user	0m1.399s
sys	0m0.033s
[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.494s
user	0m1.438s
sys	0m0.033s
[aaron@TC ruby (pic2)]$ time ./ruby benchmark/bm_vm2_method.rb 

real	0m1.466s
user	0m1.416s
sys	0m0.032s

The down side of this patch is that it increases memory usage because the size of the call cache struct gets larger, even if the call site is monomorphic. I think we could make the code expand and contract, but I'm not sure if it's worthwhile. The other downside is that it will probably slow down calls if the global method state changes, but I don't think that is a situation we should optimize for.

I've actually attached 2 patches, one adds the PIC, the other adds a tracepoint so that I could log cache hit / miss rates.


Files

0001-add-PIC.patch (2.9 KB) 0001-add-PIC.patch tenderlovemaking (Aaron Patterson), 12/03/2015 07:25 PM
0002-add-a-tracepoint-for-PIC-hit-miss.patch (4.33 KB) 0002-add-a-tracepoint-for-PIC-hit-miss.patch tenderlovemaking (Aaron Patterson), 12/03/2015 07:25 PM

Updated by tenderlovemaking (Aaron Patterson) over 8 years ago

  • Tracker changed from Bug to Feature

Updated by normalperson (Eric Wong) over 8 years ago

wrote:

I haven't run every benchmark that's checked in, but this patch speeds
up the polymorphic call benchmark by about 20%.

Nice!

The down side of this patch is that it increases memory usage because
the size of the call cache struct gets larger, even if the call site
is monomorphic. I think we could make the code expand and contract,
but I'm not sure if it's worthwhile.

Perhaps you can simplify the code so the existing monomorphic cache is
just: POLYMORPHIC_CACHE_SIZE == 1

By the way, how did you arrive at the value of 6?

Memory usage is a bigger problem of Ruby than speed to me;
so maybe a smaller value is a compromise.

The other downside is that it will probably slow down calls if the
global method state changes, but I don't think that is a situation we
should optimize for.

Agreed. Did you measure startup performance?

Updated by tenderlovemaking (Aaron Patterson) over 8 years ago

On Thu, Dec 03, 2015 at 10:51:08PM +0000, Eric Wong wrote:

wrote:

I haven't run every benchmark that's checked in, but this patch speeds
up the polymorphic call benchmark by about 20%.

Nice!

The down side of this patch is that it increases memory usage because
the size of the call cache struct gets larger, even if the call site
is monomorphic. I think we could make the code expand and contract,
but I'm not sure if it's worthwhile.

Perhaps you can simplify the code so the existing monomorphic cache is
just: POLYMORPHIC_CACHE_SIZE == 1

Yes, I'd love to do that. The reason I didn't is because it looks like
there are a bunch of other places that allocate an rb_call_cache
directly and directly set the cached members.

I wanted to keep this patch as small as possible to a) figure out if we
want to do it (check benchmarks etc), and b) try to figure out what the
actual size should be.

If we do want to do this, then I'll go through and refactor the other
places so that we don't have the strange "one off" case.

By the way, how did you arrive at the value of 6?

I started with 3 because I read somewhere (I think it was here:
http://chrisseaton.com/phd/specialising-ruby.pdf ) that 3 is good. But
Charles Nutter told me they default the PIC on JRuby to 6, so I just
took it from them.

I'm not 100% confident in the best size, which is why I added the
tracepoint for logging. With a size of 3, nearly all callsites had a
99% hit rate in the Psych tests.

Memory usage is a bigger problem of Ruby than speed to me;
so maybe a smaller value is a compromise.

I think a smaller value would be fine. I can get some stats from
running the Rails tests and running our test suite at work. Maybe that
would give us a better idea, though I think maybe 3 would be good
enough?

The other downside is that it will probably slow down calls if the
global method state changes, but I don't think that is a situation we
should optimize for.

Agreed. Did you measure startup performance?

No, I haven't measured startup performance. I've run tests suites from
a few different projects and didn't notice any startup difference, but I
didn't test the time specifically. Do you have any suggestions for how
to best do that?

--
Aaron Patterson
http://tenderlovemaking.com/

Updated by normalperson (Eric Wong) over 8 years ago

Aaron Patterson wrote:

On Thu, Dec 03, 2015 at 10:51:08PM +0000, Eric Wong wrote:

Perhaps you can simplify the code so the existing monomorphic cache is
just: POLYMORPHIC_CACHE_SIZE == 1

Yes, I'd love to do that. The reason I didn't is because it looks like
there are a bunch of other places that allocate an rb_call_cache
directly and directly set the cached members.

I wanted to keep this patch as small as possible to a) figure out if we
want to do it (check benchmarks etc), and b) try to figure out what the
actual size should be.

Benchmarks will likely vary with the old cache in the way,
especially when accessing some fields requires pulling in extra
CPU cache lines.

No, I haven't measured startup performance. I've run tests suites from
a few different projects and didn't notice any startup difference, but I
didn't test the time specifically. Do you have any suggestions for how
to best do that?

I was recently annoyed at how long 'sequel' was taking a while to load
in scripts I was running. Even "sequel --version" takes 143ms on my
system. It went up to 146ms with your patch, so there's a tiny
regression there, but no deal breaker.

There's bigger wins to be gained in startup performance (AOT
compilation), using fstatat where available to reduce string
lengths and reduce in-kernel path resolution overhead, etc...

Updated by tenderlovemaking (Aaron Patterson) over 8 years ago

On Fri, Dec 04, 2015 at 02:28:57AM +0000, Eric Wong wrote:

Aaron Patterson wrote:

On Thu, Dec 03, 2015 at 10:51:08PM +0000, Eric Wong wrote:

Perhaps you can simplify the code so the existing monomorphic cache is
just: POLYMORPHIC_CACHE_SIZE == 1

Yes, I'd love to do that. The reason I didn't is because it looks like
there are a bunch of other places that allocate an rb_call_cache
directly and directly set the cached members.

I wanted to keep this patch as small as possible to a) figure out if we
want to do it (check benchmarks etc), and b) try to figure out what the
actual size should be.

Benchmarks will likely vary with the old cache in the way,
especially when accessing some fields requires pulling in extra
CPU cache lines.

No, I haven't measured startup performance. I've run tests suites from
a few different projects and didn't notice any startup difference, but I
didn't test the time specifically. Do you have any suggestions for how
to best do that?

I was recently annoyed at how long 'sequel' was taking a while to load
in scripts I was running. Even "sequel --version" takes 143ms on my
system. It went up to 146ms with your patch, so there's a tiny
regression there, but no deal breaker.

There's bigger wins to be gained in startup performance (AOT
compilation), using fstatat where available to reduce string
lengths and reduce in-kernel path resolution overhead, etc...

Probably OT for this list, but do you have sequel installed as a gem?
If so, can you try running against rubygems master? I've put some
patches on RubyGems master that should speed up loading gems. Maybe
it will help. :)

--
Aaron Patterson
http://tenderlovemaking.com/

Updated by ko1 (Koichi Sasada) over 8 years ago

Thank you for your patch.

Comments:

Implementation:

  • Did you run some big application with your patch? it seems to have a bug. cc->aux also depends on cc->call, so that you need to duplicate the aux field.
  • Sliding is one good idea (FIFO strategy) for a few entries. Did you try ring with a counter? LRU is another idea (but not easy).

Evaluation:

  • Ideally, results of vm2_method and vm2_poly_method should be same. What is the difference?

  • please check the following counts.

    • the number of mono-calls and poly-calls on some applications. we can understand the ratio. If the ratio is high, then it will be valuable. If it is not so high, we need to consider variable length of PIC entries (grow from 1 entry).
    • count ideal cache entries number for each poly-calls. List method bodies for each calls. With this counts, we can estimate the valuable number of cache entries.
  • how memory consumption grows?

    • on my measurement on small rails application with valgrind/massif, 1/3 of memory is used by iseq related data and 1/3 of iseq data is consumed by call caches. If iseq consumes 30MB, then 10MB is consumed by call caches. with this patch, 60MB can be consumed by PIC entries.

Updated by normalperson (Eric Wong) over 8 years ago

Aaron Patterson wrote:

Probably OT for this list, but do you have sequel installed as a gem?
If so, can you try running against rubygems master? I've put some
patches on RubyGems master that should speed up loading gems. Maybe
it will help. :)

I don't consider it OT. I have sequel (4.29) installed as a gem; yes.

Since ruby-trunk r52666 ("Update to RubyGems 2.5.0+ HEAD(c6b4946)")
the only performance-related commit to rubygems master seems to be
commit a9c1aaffec0095081fd405ce78cfd689e937c1d7
("consolidate contains_requirable_file? cache")

I cloned git://github.com/rubygems/rubygems and ran "ruby setup.rb"
and did not notice a difference with "sequel --version".

Linux perf tool says ~19% of the time is still in the ruby_yyparse,
6% for malloc, and other hotspots (<5%) each in various other
places (iseq/GC/free)

Updated by ko1 (Koichi Sasada) about 7 years ago

  • Status changed from Open to Closed

I close this issue. Please file another ticket if you can find improvements on practical applications.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0