Project

General

Profile

Actions

Feature #18885

closed

End of boot advisory API for RubyVM

Added by byroot (Jean Boussier) over 2 years ago. Updated over 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:109081]

Description

Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like RubyVM.prepare, or RubyVM.ready.

It's somewhat similar to Matz's static barrier idea from RubyConf 2020, except that it wouldn't disable any feature.

Potential optimizations

nakayoshi_fork already does the following:

  • Do a major GC run to get rid of as many dangling objects as possible.
  • Promote all surviving objects to the highest generation
  • Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as 4.times { GC.start } from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating fork is not a good hook point.

Also after discussing with @jhawthorn (John Hawthorn), @tenderlovemaking (Aaron Patterson) and @alanwu (Alan Wu), we believe this would open the door to several other CoW optimizations:

Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. The Instagram engineering team introduced something like that in Python (ticket, PR).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

Scan the coderange of all strings

Strings have a lazily computed coderange attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an UNKNOWN coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

malloc_trim

This hook will also be a good point to release unused pages to the system with malloc_trim.


Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #11164: Garbage collector in Ruby 2.2 provokes unexpected CoWRejectedauthorNari (Narihiro Nakamura)Actions
Actions #1

Updated by byroot (Jean Boussier) over 2 years ago

  • Related to Feature #11164: Garbage collector in Ruby 2.2 provokes unexpected CoW added

Updated by byroot (Jean Boussier) over 2 years ago

  • Description updated (diff)

Another possible optimization I just found:

Strings have a lazily computed coderange attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an UNKNOWN coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

This also makes me think that this API isn't only useful for forking setup. Even if you use only threads or fibers, you may want to tell the VM that you are done loading and that it's now time to perform optimizations. So the API may use a more generic name.

Updated by ioquatix (Samuel Williams) over 2 years ago

This is a really nice idea. My current implementation uses GC.compact during prefork stage, and it shows a big advantage. I'm happy to test any proposals with real world workloads.

Updated by byroot (Jean Boussier) over 2 years ago

Another optimization that could be invoked from this method is malloc_trim.

Updated by Dan0042 (Daniel DeLorme) over 2 years ago

I think the state of Copy-on-Write is already pretty decent, but any improvement is of course very welcome. As to naming, since this is mainly for preforking servers, what about Process.prefork?

Updated by byroot (Jean Boussier) over 2 years ago

the state of Copy-on-Write is already pretty decent,

It depends how you look at it. In the few apps on which I optimized CoW as much as I could, only between 50% and 60% of the parent process memory is shared. That really isn't that good.

Updated by mame (Yusuke Endoh) over 2 years ago

We discussed this issue at the dev meeting. We did not reach any conclusion, but I'd like to share some comments.

What and how efficient is this proposal?

Some attendees wanted to confirm quantitative evaluation of the benefits this proposal would bring.
@ko1 (Koichi Sasada) said that he created nakayoshi_fork as a joke gem. He didn't expect people to use it seriously, and he didn't have serious quantitative measurements.

(I've heard people say that memory usage has been reduced by nakayoshi_fork, but it would be nice to be properly confirm this advantage before introduction.)

How is it integrated with Process._fork?

Process._fork has been introduced as an zero-argument API. This API is supposed to be overridden, so we cannot add an argument easily.
If we keep Process._fork as is, we need to do some GC processes like nakayoshi_fork before the hook of Process._fork. Is it OK?

Are "short-lived" forks needed?

How much are "short-lived" forks used nowadays? The major use case where Process.exec is called shortly after Process.fork, is covered by Process.spawn.
If there is few use cases for "short-lived" forks, we may change the default behavior to "long-lived".
However, we sometimes use fork in tests, to invoke a temporal web server, for example. Calling GC whenever calling fork might be too heavy.

Is GC called whenever fork(long_lived: true) is called?

Here is a typical server code that uses fork:

loop do
  sock = servsock.accept
  if fork(long_lived: true)
    ...
  end
end

The parent process creates only a socket object for each iteration. It looks somewhat useless to call full GC in the parent process every time fork(long_lived: true) is called. A more intelligent strategy may be preferable here.

Updated by byroot (Jean Boussier) over 2 years ago

@mame (Yusuke Endoh)

it would be nice to be properly confirm this advantage before introduction.

https://bugs.ruby-lang.org/issues/11164 is an example of how bad things can go without nakayoshi_fork (or similar). I can get production data from some of our apps if you wish, but the effect is going to be very app dependent, so I'm not sure if it's very relevant. You can craft demo-apps for which memory usage totally blow up if you don't promote objects to the old generation before forking.

How is it integrated with Process._fork?

Since I wrote this, I'm now convinced that it shouldn't be a fork argument, but a distinct API on RubyVM, since if you fork multiple workers you don't want to run them things again as it could invalidate CoW in previous workers.

So IMO RubyVM.prepare_for_long_lived_fork is the proper API (aside from the name which I doubt is desirable).

Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option. Hence the main reason of this proposal.

Updated by mame (Yusuke Endoh) over 2 years ago

So IMO RubyVM.prepare_for_long_lived_fork is the proper API (aside from the name which I doubt is desirable).

After calling this, all fork calls are treated as "long-lived". Is my understanding right?

Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option.

I know that, but it seemed hard to me to convince the committers to change the API first for optimizations that have not been implemented yet and we don't know how effective they will be. IMO, it is good to focus on the use case of nakayoshi_fork since it is already implemented and used by not a few people. If there is a proper evaluation of the effect of nakayoshi_fork, it would be easier to persuade @matz (Yukihiro Matsumoto).

Updated by byroot (Jean Boussier) over 2 years ago

After calling this, all fork calls are treated as "long-lived". Is my understanding right?

Well, this wouldn't change anything to the Process.fork implementation. I think I need to rewrite the ticket description because it is now confusing, I'll do it in a minute.

Also as said before I don't even think this is specific to forking servers anymore, I think RubyVM.make_ready or something like that would be just fine. Even if you don't fork, optimizations such as precomputing inline caching could improve performance of the first request.

it is good to focus on the use case of nakayoshi_fork

Ok, so here's a thread from when Puma added it as an option two years ago, https://github.com/puma/puma/issues/2258#issuecomment-630510423

After fixing the config bug in nakayoshi_fork, Codetriage is now showing about a 10% reduction in memory usage

Some other people report good numbers too, but generally they enabled other changes at the same time.

Updated by byroot (Jean Boussier) over 2 years ago

  • Subject changed from Long lived fork advisory API (potential Copy on Write optimizations) to End of boot advisory API for RubyVM
  • Description updated (diff)

Ok, Ip updated the description, it's still very much focused on CoW, but hopefully it should now be more clear that's it's not the only benefit.

Also it now only ask a method on RubyVM, which could perfectly be marked as experimental, so API change concerns should be minimal.

Updated by Dan0042 (Daniel DeLorme) over 2 years ago

I think the terminology used here might cause some confusion in the discussion.

"End of boot" makes it sound like this API would be useful for non-forking servers once they have finished their "boot" sequence. But from what I understand this is still very much a fork-specific API. Is there any point to precompute inline caches if there is no fork?

"Long lived" children processes are not really the point I think? Imagine a (ridiculous) architecture where the parent keeps spawning children and each child serves a single request before dying. Despite being short-lived, these processes would benefit from this API. So it's not about preparing children for being long-lived, it's about preparing the parent for having any children.

Updated by byroot (Jean Boussier) over 2 years ago

Is there any point to precompute inline caches if there is no fork?

Yes, the first "request" (or whatever your unit of work is) won't have to do it. So you are moving some work to boot time, instead of user input processing time.

these processes would benefit from this API.

For the CoW parts no, not much. If the child isn't going to live for long, it's unlikely to invalidate that many pages.

Updated by matz (Yukihiro Matsumoto) over 2 years ago

I am OK with adding this feature, but I have some concerns with the place and the name.
RubyVM is not globally available (e.g., not for JRuby or TruffleRuby). And I don't think prepare or ready describes the whole functionality.

Matz.

Updated by byroot (Jean Boussier) over 2 years ago

Thank you Matz.

RubyVM is not globally available (e.g., not for JRuby or TruffleRuby).

Yes, what was on purpose because the behavior would be very VM specific, some VM might not even to have it. It's not meant to be a cross implementation feature.

And I don't think prepare or ready describes the whole functionality.

I'll try to come up with other names.

Updated by Eregon (Benoit Daloze) over 2 years ago

An API to notify "end of boot" seems useful beyond just fork COW optimizations, as you say.
For instance a JIT might use that as a hint for what to compile/stop compiling/purge the queue during boot/reset compilation counters/etc.
So it shouldn't be under RubyVM which means only available on CRuby (forever).

Maybe a Kernel class method?

Kernel.booted/Kernel.application_booted/Kernel.code_loaded/Kernel.startup_done maybe?

Updated by byroot (Jean Boussier) over 2 years ago

What about ObjectSpace?

Updated by byroot (Jean Boussier) over 2 years ago

So I wrote a reproduction script to showcase the effect of constant caches on Copy on Write performance:

class MemInfo
  def initialize(pid = "self")
    @info = parse(File.read("/proc/#{pid}/smaps_rollup"))
  end

  def pss
    @info[:Pss]
  end

  def rss
    @info[:Rss]
  end

  def shared_memory
    @info[:Shared_Clean] + @info[:Shared_Dirty]
  end

  def cow_efficiency
    shared_memory.to_f / MemInfo.new(Process.ppid).rss * 100.0
  end

  private

  def parse(rollup)
    fields = {}
    rollup.each_line do |line|
      if (matchdata = line.match(/(?<field>\w+)\:\s+(?<size>\d+) kB$/))
        fields[matchdata[:field].to_sym] = matchdata[:size].to_i
      end
    end
    fields
  end
end

CONST_NUM = Integer(ENV.fetch("NUM", 100_000))

module App
  CONST_NUM.times do |i|
    class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
      Const#{i} = Module.new

      def self.lookup_#{i}
        Const#{i}
      end
    RUBY
  end

  class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
    def self.warmup
      #{CONST_NUM.times.map { |i| "lookup_#{i}"}.join("\n")}
    end
  RUBY
end

puts "=== fresh parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

def print_child_meminfo
  meminfo = MemInfo.new
  puts "PSS: #{meminfo.pss} kB"
  puts "Shared #{meminfo.shared_memory} kB"
  puts "CoW efficiency: #{meminfo.cow_efficiency.round(1)}%"
  puts
end

fork do
  puts "=== fresh fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait

App.warmup

puts "=== warmed parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

fork do
  puts "=== warmed fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait

Results:

$ docker run -v $PWD:/app -it ruby:3.1 ruby /app/app.rb
=== fresh parent stats ===
RSS: 236104 kB

=== fresh fork stats ===
PSS: 117198 kB
Shared 233828 kB
CoW efficiency: 99.0%

PSS: 199734 kB
Shared 72740 kB
CoW efficiency: 30.8%

=== warmed parent stats ===
RSS: 237128 kB

=== warmed fork stats ===
PSS: 117632 kB
Shared 234880 kB
CoW efficiency: 99.1%

PSS: 118318 kB
Shared 235444 kB
CoW efficiency: 99.3%

What this shows

When we first fork the process, the memory cost is close to 0. The parent process has ~230MiB RSS, but 99% of that is shared with the first child, putting the actual cost of the fork at barely a couple MiB.

However as soon as we start executing code in the child that wasn't warmed up in the parent, the inline caches are being filled, which invalidates the shared pages. After that only a third of the parent memory is shared, putting the cost of the child at about 163MiB.

The second part of the reproduction first warmup these caches in the parent before forking. As a result the child doesn't invalidate shared memory when it execute the code, and the cost of the child remain totally negligible.

What it means for the real world

Of course this repro is specially crafted to show the impact of constant caches, there are other source of invalidations such as method caches etc, but as mentioned now that https://github.com/ruby/ruby/pull/6187 was merged, it should be easy to prewarm the constant caches when that proposed API is called.

I guess all we need is a name. Maybe ObjectSpace.optimize?

Updated by ioquatix (Samuel Williams) about 2 years ago

This is awesome. Nice work.

I also like warmup as a name.

Updated by Dan0042 (Daniel DeLorme) about 2 years ago

+1 for Process.warmup

Updated by matz (Yukihiro Matsumoto) about 2 years ago

Process.warmup sounds better than other candidates. My only concern is that the target of warming up might not be Process in the future (e.g. when Ractor local GC is introduced).

Matz.

Updated by byroot (Jean Boussier) about 2 years ago

Thank you Matz!

My only concern is that the target of warming up might not be Process in the future

Given the type of optimizations we have in mind right now, I think they'll still be global even on a Ractor heavy context. The main semantic of this signal is "I'm done loading my code" which doesn't change even with heavy Ractor use.

Actions #23

Updated by byroot (Jean Boussier) over 1 year ago

  • Status changed from Open to Closed

Applied in changeset git|ba6ccd871442f55080bffd53e33678c0726787d2.


Implement Process.warmup

[Feature #18885]

For now, the optimizations performed are:

  • Run a major GC
  • Compact the heap
  • Promote all surviving objects to oldgen

Other optimizations may follow.

Actions #24

Updated by byroot (Jean Boussier) over 1 year ago

  • Status changed from Closed to Open

Updated by ioquatix (Samuel Williams) over 1 year ago

Looking forward to using this.

Actions #26

Updated by byroot (Jean Boussier) over 1 year ago

  • Status changed from Open to Closed

Applied in changeset git|fa30b99c34291cde7b17cc709552bc5681729a12.


Implement Process.warmup

[Feature #18885]

For now, the optimizations performed are:

  • Run a major GC
  • Compact the heap
  • Promote all surviving objects to oldgen

Other optimizations may follow.

Actions

Also available in: Atom PDF

Like2
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0