Bug #9262

global_method_cache should be configurable or grow automatically

Added by Aman Gupta 4 months ago. Updated about 1 month ago.

[ruby-core:59198]
Status:Open
Priority:Normal
Assignee:Koichi Sasada
Category:-
Target version:next minor
ruby -v:- Backport:1.9.3: UNKNOWN, 2.0.0: UNKNOWN

Description

The globalmethodcache is currently a fixed 2048 entries. This is not configurable, and can be inadequate for large applications with thousands of classes and methods.

In our app, increasing the size of the cache to 32768 entries reduced time spent in searchmethod and overall pressure on stlookup:

before
420 14.0% stlookup
182 6.1% vm
search_method (inline)

after
265 9.5% stlookup
125 4.5% vm
search_method (inline)

It's possible the VM could grow globalmethodcache automatically, using some heuristic based on the number of long-lived classes or method_entries that are defined.
However this would break the hashing in the current implementation.

History

#1 Updated by Aman Gupta 4 months ago

funny-falcon proposed removing the globalmethodcache altogether, in favor of a per-klass method cache:

https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache
https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache.diff

#2 Updated by Sam Saffron 4 months ago

Before I measure the changes proposed by funny falcon I wanted to present the sad state of affairs.

I applied this patch to gather method cache statistics: https://github.com/SamSaffron/ruby/commit/e6ecf6d8390b2e4036b91778268a7544f364b8f1

I then wrote some middleware to measure the impact of the method cache while browsing around Discourse in production mode:

class CacheStatMiddleware
def initialize(app, config={})
@app = app
end

def call(env)
  start = Time.new
  hits = Kernel.cache_hits
  misses = Kernel.cache_misses
  time = Kernel.cache_miss_time

  r = @app.call(env)
  lost = Kernel.cache_miss_time - time
  total = ((Time.new - start)* 1000 * 1000).to_i

  pct = (lost.to_f / total.to_f) * 100

  puts "hits #{Kernel.cache_hits - hits} " <<
       "misses #{Kernel.cache_misses - misses } " <<
       "microsecs lost #{lost} #{pct.round(2)}%  " <<
       "cache usage #{Kernel.cache_entries} / #{Kernel.cache_size} " <<
       "req duration #{total}  "

  r
end

end

config.middleware.insert 0, CacheStatMiddleware


Here are some results:

hits 74383 misses 5095 microsecs lost 4620 7.75% cache usage 2048 / 2048 req duration 59613

hits 74033 misses 5093 microsecs lost 4750 8.04% cache usage 2048 / 2048 req duration 59070

hits 73962 misses 5106 microsecs lost 4509 7.74% cache usage 2048 / 2048 req duration 58253

hits 677 misses 229 microsecs lost 393 6.23% cache usage 2048 / 2048 req duration 6305

hits 523 misses 35 microsecs lost 211 3.68% cache usage 2048 / 2048 req duration 5738

hits 549 misses 31 microsecs lost 279 3.58% cache usage 2048 / 2048 req duration 7804

hits 4412 misses 143 microsecs lost 97 3.14% cache usage 2048 / 2048 req duration 3088

hits 47 misses 18 microsecs lost 21 8.97% cache usage 2048 / 2048 req duration 234

hits 3405 misses 130 microsecs lost 152 4.88% cache usage 2048 / 2048 req duration 3115

hits 519 misses 158 microsecs lost 214 8.1% cache usage 2048 / 2048 req duration 2641

hits 114133 misses 14584 microsecs lost 14338 9.31% cache usage 2048 / 2048 req duration 154000

hits 44835 misses 2754 microsecs lost 2850 8.16% cache usage 2048 / 2048 req duration 34914

hits 75084 misses 8601 microsecs lost 8048 10.95% cache usage 2048 / 2048 req duration 73475


Conclusion:

For an app like Discourse 3-10% of request time is occupied looking up methods, due to cache inefficiency.

Out of the box cache is just too small to fit all the entries required, so its thrashing.


If we raise the method cache x16 size we see:

hits 720 misses 5 microsecs lost 11 0.44% cache usage 17045 / 32768 req duration 2509

hits 40 misses 0 microsecs lost 0 0.0% cache usage 17061 / 32768 req duration 242

hits 53076 misses 266 microsecs lost 685 1.67% cache usage 17090 / 32768 req duration 41001

hits 47234 misses 238 microsecs lost 315 1.24% cache usage 17117 / 32768 req duration 25418

hits 53012 misses 281 microsecs lost 850 2.03% cache usage 17145 / 32768 req duration 41803

hits 720 misses 5 microsecs lost 15 0.65% cache usage 17163 / 32768 req duration 2308

hits 52353 misses 261 microsecs lost 605 1.63% cache usage 17190 / 32768 req duration 37031

hits 46785 misses 190 microsecs lost 429 1.32% cache usage 17207 / 32768 req duration 32392

hits 52893 misses 273 microsecs lost 610 1.71% cache usage 17240 / 32768 req duration 35571

hits 720 misses 5 microsecs lost 7 0.45% cache usage 17258 / 32768 req duration 1551

hits 52295 misses 261 microsecs lost 709 1.75% cache usage 17287 / 32768 req duration 40552

hits 46784 misses 191 microsecs lost 393 1.35% cache usage 17308 / 32768 req duration 29146

hits 86409 misses 2013 microsecs lost 2655 3.4% cache usage 17829 / 32768 req duration 78033

hits 644 misses 7 microsecs lost 22 0.64% cache usage 17843 / 32768 req duration 3458

hits 46729 misses 295 microsecs lost 423 1.73% cache usage 17864 / 32768 req duration 24385

hits 82777 misses 893 microsecs lost 1807 2.57% cache usage 17972 / 32768 req duration 70341

hits 46819 misses 191 microsecs lost 400 1.18% cache usage 17989 / 32768 req duration 33862

hits 82412 misses 865 microsecs lost 1426 2.12% cache usage 18099 / 32768 req duration 67163


Even when the cache has room, too many collisions are happening forcing cache eviction when uneeded, leave 2% of the perf on the table, this is significantly better than 2-10% though.

This can be fixed possibly by introducing quadratic probing or perhaps a cuckoo like algorithm.

The global method cache has a couple of advantages over per class cache and a few disadvantage.

Advantages:

  1. Size is cleanly capped
  2. Very easy to gather stats and info about it
  3. Expiry semantics are fairly clean, can force a method_state increment when utilization is too high

Disadvantages:

  1. Extra hashing required to look up elements, you need the klass hash
  2. Extra storage required (klass_seq stored per element)

#3 Updated by Aman Gupta 4 months ago

I ran some benchmarks with funny-falcon's patch. Memory usage is increased by 5-10mb RSS in our app, and response time is improved by 4-8%.

before:

500 requests to https://github.com/dashboard as tmm1
peak memory: 349.4 MB RSS
cpu time: 7,171ms total (14ms avg/req, 13ms - 21ms)
500 requests to https://github.com/github/github/pull/17695 as tmm1
peak memory: 404.6 MB RSS
cpu time: 98,713ms total (197ms avg/req, 184ms - 285ms)

after:

500 requests to https://github.com/dashboard as tmm1
peak memory: 359.2 MB RSS
cpu time: 6,806ms total (13ms avg/req, 12ms - 20ms)
500 requests to https://github.com/github/github/pull/17695 as tmm1
peak memory: 410.3 MB RSS
cpu time: 95,559ms total (191ms avg/req, 177ms - 280ms)

#4 Updated by Sam Saffron 4 months ago

There was a concern that gettimeofday is expensive and adds too much time to my results

I measured:

static VALUE
rbtimingoverheadper100k(VALUE klass){
int i, iterations = 100000;
struct timeval temp,tv1,tv2,res;

gettimeofday(&tv1, NULL);
for(i=0; i<iterations; i++) {
gettimeofday(&temp, NULL);
}
gettimeofday(&tv2, NULL);

timersub(&tv2, &tv1, &res);
return INT2FIX(res.tv_sec*100000 + res.tv_usec);

}

turns out that its 2000 microseconds per 100k calls, so the additional time involved due to measuring is negligible, the numbers can be trusted

#5 Updated by Sam Saffron 4 months ago

Discourse Bench,

Disabled Method Cache vs Current cache vs 16x larger method cache vs Funny Falcon

Disabled Method cache:

Your Results: (note for timings- percentile is first, duration is second in millisecs)

homepage:
50: 43
75: 44
90: 48
99: 67
topic
page:
50: 14
75: 15
90: 16
99: 39
homepageadmin:
50: 57
75: 60
90: 64
99: 89
topicpageadmin:
50: 23
75: 24
90: 26
99: 35
timings:
load_rails: 2733
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Current Method Cache:


homepage:
50: 37
75: 39
90: 43
99: 60
topic
page:
50: 12
75: 13
90: 14
99: 40
homepageadmin:
50: 48
75: 50
90: 55
99: 77
topicpageadmin:
50: 20
75: 21
90: 23
99: 32
timings:
load_rails: 2721
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

16x larger method cache

Your Results: (note for timings- percentile is first, duration is second in millisecs)

homepage:
50: 35
75: 38
90: 41
99: 67
topic
page:
50: 11
75: 11
90: 13
99: 34
homepageadmin:
50: 45
75: 47
90: 52
99: 79
topicpageadmin:
50: 18
75: 19
90: 20
99: 29
timings:
load_rails: 2761
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Funny Falcon

Your Results: (note for timings- percentile is first, duration is second in millisecs)

homepage:
50: 34
75: 35
90: 38
99: 53
topic
page:
50: 10
75: 11
90: 12
99: 37
homepageadmin:
50: 43
75: 45
90: 49
99: 68
topicpageadmin:
50: 18
75: 19
90: 20
99: 32
timings:
load_rails: 2771
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz


Observations:

Funny Falcon patch is clearly fastest, x10 cache size improvement makes a very big difference, method caching impact is about 20% give or take a bit.

#6 Updated by Sam Saffron 4 months ago

Note: for comparison see the results on an unpatched 2.0.0 p353


homepage:
50: 35
75: 40
90: 107
99: 119
topic
page:
50: 12
75: 13
90: 17
99: 150
homepageadmin:
50: 45
75: 55
90: 118
99: 127
topicpageadmin:
50: 19
75: 21
90: 76
99: 100
timings:
load_rails: 3644
ruby-version: 2.0.0-p353
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

either the 16x size increase or the optimal falcon implementation means Ruby 2.1 no longer regress performance.

#7 Updated by Yura Sokolov 4 months ago

Add couple of fixes to patch:
1. fix rbmcacheresize - it didn't copy methodstate and classserial on resize, so that cache were invalidated on next check.
2. add "gargabe collection" of "undefined" cache entries - do not copy them on resize, shrink cache size if it is too sparse after resize.

https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache
https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache.diff

#8 Updated by Sam Saffron 3 months ago

FYI it seems perf of method lookups has regressed in 2.1:

https://gist.github.com/SamSaffron/8232978

This makes this change particularly important

#9 Updated by Koichi Sasada 3 months ago

Hi,

Now, method cache technique is important and we need more measurement
and techniques.

##

From Ruby 2.0, we use inline/global method cache aggressively. So
performance impact on method cache miss has huge impact (compare with
Ruby 1.8, 1.9), that guys already show.

funny_falcon's approach is very impressive to solve this problem.
However, now I'm not sure how it increase memory imapct.

tmm1 measured the memory impact on
https://bugs.ruby-lang.org/issues/9262#change-43840 . This survery is
very impressive. I think it is better we measure other cases. For
example, if some classes calls many methods at onece, it will memory
issue. I think the simple limitation (cap) approach with funny_falcon's
patch works fine.

##

New years holiday (Japanese take holidays in new years week), I'm
thinking about this issue. Some ideas are available.

(1) inline cache
(1-1) polymorphic inline cache
(1-2) per class inline cache other than call site
(1-3) per-method cache invalidation
(2) global cache
(2-1) per class method cache w/ cap

Now the above ideas are not implemented/verified. And huge effort is
needed (because we need to change the method entry data structure).
Before the try, I need to know the why and how method cache is missed.

##

Basically I don't against to introduce and backport some proposed
patches (w/ measurement, if we can). In my opinion, simple variable
global cache entry size approach will fine for backport.

And also I try above ideas for Ruby 2.2. Current patches are good
starting point, I think.

--
// SASADA Koichi at atdot dot net

#10 Updated by Eric Wong 3 months ago

SASADA Koichi ko1@atdot.net wrote:

From Ruby 2.0, we use inline/global method cache aggressively. So
performance impact on method cache miss has huge impact (compare with
Ruby 1.8, 1.9), that guys already show.

Not a serious patch or benchmark, but I don't think it's too bad without
global method cache (inline cache enabled).

OPTGLOBALMETHOD_CACHE did not disable write to method cache,
only writes before

http://bogomips.org/ruby.git/patch/?id=a899e54e6abc13b8816e6c5f69ff560918db4be1

git://80x24.org/ruby nomethodcache

I'll rerun the test later tonight when my machine is quieter.

#11 Updated by Eric Wong 3 months ago

It looks like the performance regressions w/o global method cache are
because rbfuncall and friends do not have call info, so they don't
hit the inline cache. So perhaps we should just add a call info-aware
version of rb
funcall-like functions so we can just use inline cache
everywhere.

I'm pretty sure ko1 already knows that, but I just discovered it :x
tmm1: what do you think?

#12 Updated by Eric Wong 3 months ago

Eric Wong normalperson@yhbt.net wrote:

So perhaps we should just add a call info-aware
version of rb_funcall-like functions so we can just use inline cache
everywhere.

I should add: a cheap way to do this might be to just use a do/while
macro wrapping rbfuncall with a static _thread rbcallinfo_t
variable in its scope.

_thread works on gcc and clang, and maybe other compilers, too,
but other compilers may be stuck with the slow version (or non-MT-safe).
(I prefer we use _
thread in case we get rid of GVL)

#13 Updated by Koichi Sasada 2 months ago

  • ruby -v changed from trunk to -

(2014/01/30 4:17), Eric Wong wrote:

It looks like the performance regressions w/o global method cache are
because rbfuncall and friends do not have call info, so they don't
hit the inline cache. So perhaps we should just add a call info-aware
version of rb
funcall-like functions so we can just use inline cache
everywhere.

I'm pretty sure ko1 already knows that, but I just discovered it :x
tmm1: what do you think?

charliesome have proposed a simiar API (sorry I forget the URL).
He use only static variable (not thread-local) and it seems works well.
However, I think it may have pitfalls (recursive call) so I can't decide
to introduce it.

--
// SASADA Koichi at atdot dot net

#14 Updated by Yura Sokolov 2 months ago

I tried to do once with static variables, but had performance regression. Perhaps, i missed something.

#15 Updated by Eric Wong about 1 month ago

normalperson@yhbt.net wrote:

http://bogomips.org/ruby.git/patch/?id=a899e54e6abc13b8816e6c5f69ff560918db4be1

Btw, I'd like to commit just the vm_method.c change for this
to avoid writing to method cache if disabled.

It also replaces CPP #if with C if for readability.

Ugh, commit message was long, just the vmmethod.c change:
http://bogomips.org/ruby.git/diff/vm
method.c?id=a899e54e6abc1

#16 Updated by Nobuyoshi Nakada about 1 month ago

Eric Wong wrote:

Btw, I'd like to commit just the vm_method.c change for this
to avoid writing to method cache if disabled.

Agreed.

It also replaces CPP #if with C if for readability.

I don't think it is needed.

#17 Updated by Eric Wong about 1 month ago

nobu@ruby-lang.org wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINEDMETHODENTRYP is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc
disable.patch

#18 Updated by Eric Wong about 1 month ago

Eric Wong normalperson@yhbt.net wrote:

nobu@ruby-lang.org wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINEDMETHODENTRYP is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc
disable.patch

r45261. I used my original hunk in rbmethodentrygetwithoutcache
since I got "make test" failures on my 32-bit machine without checking
UNDEFINED
METHODENTRYP(me) when bypassing GMC

#19 Updated by Nobuyoshi Nakada about 1 month ago

Eric Wong wrote:

nobu@ruby-lang.org wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINEDMETHODENTRYP is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc
disable.patch

I meant "replace CPP #if with C".

Also available in: Atom PDF