Project

General

Profile

Actions

Feature #16614

closed

New method cache mechanism for Guild

Added by ko1 (Koichi Sasada) about 1 year ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Target version:
[ruby-core:97084]

Description

I'm developing a brand new method caching mechanism for Guild and further improvements.

Important ideas:

  • (1) Disposable inline method cache (IMC) for race-free inline method cache
    • Making call-cache (CC) as a RVALUE (GC target object) and allocate new CC on cache miss
  • (2) Introduce per-Class method cache (pCMC)
    • Instead of fixed-size global method cache (GMC), pCMC allows flexible cache size
    • Caching CCs reduces CC allocation and allow sharing CC's fast-path between same call-info (CI) call-sites.
  • (3) Invalidate an inline method cache by invalidating corresponding method entries (MEs)
    • Instead of using class serials, we set "invalidated" flag for method entry itself to represent cache invalidation.
    • Compare with using class serials, the impact of method modification (add/overwrite/delete) is small.
    • Updating class serials invalidate all method caches of the class and sub-classes.
    • Proposed approach only invalidate the method cache of only one ME.

Acronym

  • IMC: Inline Method Cache
  • GMC: Global Method Cache (and its table)
  • pCMC: per-Class Method Cache
  • CC: Call Cache (struct rb_call_cache, renamed to rb_callcache)
  • CI: Call Info (struct rb_call_info, renamed to rb_callinfo)
  • ME: method entry (method body)

Background

Method caching mechanism is an important technique to improve performance of method invocation on OO language interpreters.
Without method caching, we need to traverse class-tree to find method definition for each method invocation.

class-tree_traversak

Recent Ruby uses two layer method caching: inline method cache (IMC) and global method cache (GMC).
IMC caches method entries (ME / method definition) and fast-path information (in some cases, skip checking code).
GMC caches method entries.
Method "cache" requires cache invalidate protocol and Ruby interpreter employs (basically) class-serial approach.

On a method call, filling call-cache (CC) by IMC and GMC, looking up class-tree and invoke it.

method_call_flow

The following subsections introduce Ruby's implementation in brief.

Inline method cache (IMC)

IMC is represented by rb_call_cache struct, which are allocated per call-site.
rb_call_cache contains keys and values:

  • Keys: method_state and class_serial(s).
  • Values: (v1) a method entry (ME) and (v2) info to invoke the method

In a narrow sense, (v1) is a main purpose of method cache.
IMC also contains (v2) to skip some checking processes using a result of last execution. We call it is fast-path.

When cache hits (a CC is not invalidated), we can use values without searching ME in class tree and can use fast-path.

Checking cache invalidation is done by compare with current serials.

imc_hit_miss_judge

per-Class serials are updated when defining a method and so on. Not only the modified class, but also sub-classes of modified classes. So that modified class and its sub-classes are updated, and all of method caches which points to the methods of these classes are invalidated.

ex1) we modify Object class, then all of method cache will be invalidated.
ex2) we add C#foo, then all of caches for methods of C and sub-classes are invalidated. If D < C, then a cache for D#bar is also invalidated, even if C#foo is not related to D#bar. Essentially, we don't need to invalidate IMCs for D#bar, but it is easy to invalidate.

imc_invalidation

Global method cache (GMC)

GMC is a fixed-size hash table: Keys (serials) -> Value (a method entry), only one table in a interpreter process.
Keys are same as IMC's.

Method lookup on method call using IMC and GMC

At first, it checks IMC. If it hits, then return ME and fast-path information to invoke the method.
If it misses, then check GMC to get corresponding ME.
If we can't find ME in GMC, then we search ME in class-tree.
After looking up a ME, setup CC for IMC and invoke a found method.

Problems

There are several issues on current method caching mechanism, especially for Guild: parallel execution system.

Atomic access

For IMC, keys and values are written randomly. If guilds (parallel processing elements) share IMC, it introduces data race.
To avoid the race, there are several traditional ways:

(1) access synchronously
(2) copy CCs for Guilds

However, Accessing IMC is so frequently in Ruby, so that (1) is unacceptable because of its overhead.
(2) is one way for performance, but it can consumes O(n) memory (n: the number of Guilds).

FYI: very simple Rails web app has about 10,000 call-site.

The size of Global method cache

GMC is a simple table, but the size is fixed. We can tune by RUBY_GLOBAL_METHOD_CACHE_SIZE envval at the process start time, but it is not easy to predict the optimal value.

Update references for GC.compact

GC.compact requires updating the references in IMC/GMC. However it is difficult to update references because of several implementation restrictions. Now GC.compact invalidate all of method cache. It works, but not so fast.

Invalidate extra methods

Assume that we cached C1#bar, C1#baz, ... (C1 < C0). When C0#foo is added, then C1#bar (and others) will be invalidated even if C1#bar and C0#foo is not related.

(Most of case, this is not a problem because such (re-)definitions are not applied in production mode, I guess)

Can't share the knowledge between call-sites.

We allocate IMC per call-site, so there is no relation between call-sites.
However, we can share the fast-path information between them.

For example, the following code calls require methods with same manner.

require 'x'
require 'y'
require 'z'

If we can share call_cache information between the 3 call-sites, we can call require with fast-path for latter 2 method calls (first call is for setting up fast-path).

Proposal

To solve above issues, I propose a new method cache mechanism. There are two layers.

(1) Disposable inline method cache

To assure atomic access, I use similar technique to RCU. Invalidate the IMC, then throw away this data structure instead of reusing it. Allocate an IMC as an GC target objects, thrown away IMC will be collected by GC.

(2) per-Class method cache (pCMC)

Instead of global method cache, we introduce per-Class method cache (pCMC).
pCMC is a map mid -> CCs. "CCs" is (a) cached method entry and (b) cached call-cache entries.
(a) is similar to the global method cache.
(b) is new idea. We can reuse call-cache entries between same type call-site.
In this context, the type is "call-info", a tuple of (argc, flags and keywords).
CCs are disposable, but we can share a CC

metho_call_flow_comparison

Disposable IMC

Each call-site points Call Info (CI) and Call Cache (CC). CI is static information which are decided at compile time. CC is dynamic information and master store cache information by mutating single CC per call-site.

Disposable IMC, we make a CC as a GC target object. ISEQ needs to mark cached CCs.
Allocate a new CC and point it from call-site if cache miss, and release missed-CC.

Cache hit/miss decision is very simple: compare the klass (receiver's class). The code is : CC->klass == klass. Very simple.
After that, we need to check validity of the cache, by checking cached ME's valid bit: CC->me->valid_bit.

imc_hit_miss_judge_cmp

If cache is not hit, then allocate or reuse another CC by asking per-Class method cache (pCMC) and store it in the call-site.
Old CC is released from cached call-site (ISEQ) and GC will collect it soon if the CC is not cached by pCMC.

per-Class method cache (pCMC)

pCMC is consists of tables which each classes have.

A pCMC is a table mid -> CCs. CCs is new data structure, based on two data: a cached method entry (ME) and CC entries corresponding to call-info (CI).

Cached ME is a same cached value of GMC. The difference is number of the tables.

CC entries is a set of CC objects for each CIs ([[ci1, cc1], [ci2, cc2], ...]).
On a same CI, I found that a same CC works well (or I fixed to work well).
It means that call-sites which use same CI can use CC's fast-path.

pCMC

Invalidate protocol

Two type invalidation protocol: clearing CC->klass = 0 and clear valid bit for ME (ME->valid_bit = 0).
It depends on how to modify a class by adding/overwriting/removing methods.

Modified class has no sub-classes

We only need to clear CCs of specified method id in class's pCMC table's entry.
Like that: modified_class.pCMC_table[mid].each_cc_entries{|cc| cc.klass = 0}
The computation time i O(n) (n: the number of CIs for cache). But n should be small numbers, I guess.

invalidate_nosubclasses.PNG

Modified class has sub-classes

We need to clear all sub-classes pCMC tables. However, the computation time is O(n) (n: number of sub classes).
Instead of clearing all corresponding caches in sub-classes , I decide to modify ME to tell invalidation.

if ME.cached_by_anyone?
  ME, defined_class = lookup(modified_class, mid)
  defined_class.define_method(mid, ME.dup)
  ME.valid_bit = 0
end

invalidate_with_subclasses.PNG

Computation time is O(n), n is class-tree height (modified_class.ancestors.size).
It allocates and copy the found ME, so it is not low cost operation.

Discussion

Solved issues

Atomic access

By using disposable IMC (CC), atomic access of IMC is guaranteed (strictly speaking, we need some more assumptions, but we can do that). Most of cases, IMC works well (hit frequently), so we need to tune this code, even if on the parallel computing.

Other logic requires appropriate locking mechanism. For example, to access pCMC we need to introduce synchronization mechanism (global lock, etc).

The size of Global method cache

We don't have GMC anymore.

Update references for GC.compact

CCs are objects, so we can update references by adding simple updating code.

Invalidate extra methods.

We only invalidate a specified method.
However there is a possibility to invalidate a non-related method yet.

Can't share the knowledge between call-sites.

We can share CCs between call-sites point to same CIs.
So we can utilize fast-path information more.

Introduced issues

Memory consumption predictability

pCMC can contain much memory because there is no limitation just now.
I don't think it is problem on most of cases, but we can introduce some limitation for it.

Implementation

https://github.com/ruby/ruby/pull/2888

Measurements

  • On small benchmark and optcarrot, almost same performance compare with master.
  • On simple rails application, I measured slows down a bit (discuss on next section).

TODO

Negative cache

We measured slows down on simple Rails application maybe because of lack of negative cache. Now we don't cache the information about "the specified method is not defined". So method looking by class-tree traversal is used frequently than master. We can introduce negative cache by introducing "undef" method entry, but it changes current code in many places and the patch will be bigger and bigger.

This is why I make a ticket now.

Polymorphic inline cache

Ruby 2.7 introduced limited version of polymorphic inline cache (class serials are not same, but points to the same method entry, they share the same method entry). Current proposed implementation does not have this feature.

Using value-ed CC it is easy to extend polymorphic inline cache from monolithic inline cache, I guess.
(it should be difficult to design how many entries are needed, LRU implementation and so on, though)

Callable method entry

To make explanation simple, I eliminate the description about method entries and callable method entries.
On master, GMC cached method entries and pCMC cached callable method entries. It can affect performance and we need to check it more.
This is a kind of trivial issue, but we need to make clear how to treat two kind of method entries.

MJIT support

Now it can compile on MJIT, but not completed (maybe it has GC bug).

CCs for rb_funcallv() and others

Now it is disabled but it is easy to implement it.

Conclusion

This ticket propose a new method cache mechanism, algorithm and implementation.
It is not completed, but I believe these techniques are needed for parallel execution.


Files

imc_hit_miss_judge.PNG (20.7 KB) imc_hit_miss_judge.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
class-tree_traversak.PNG (19.9 KB) class-tree_traversak.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
imc_invalidation.PNG (33.2 KB) imc_invalidation.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
imc_hit_miss_judge_cmp.PNG (27.7 KB) imc_hit_miss_judge_cmp.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
invalidate_nosubclasses.PNG (37.7 KB) invalidate_nosubclasses.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
invalidate_with_subclasses.PNG (65.7 KB) invalidate_with_subclasses.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
metho_call_flow_comparison.PNG (25.2 KB) metho_call_flow_comparison.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
method_call_flow.PNG (14.1 KB) method_call_flow.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM
pCMC.PNG (46 KB) pCMC.PNG ko1 (Koichi Sasada), 02/07/2020 08:37 AM

Related issues

Related to Ruby master - Bug #16658: `method__cache__clear` DTrace hook was dropped without replacementClosedko1 (Koichi Sasada)Actions
Related to Ruby master - Bug #17373: Ruby 3.0 is slower at Discourse bench than Ruby 2.7OpenActions
Actions

Also available in: Atom PDF