Feature #18885: End of boot advisory API for RubyVM - Ruby - Ruby Issue Tracking System

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#1

Related to Feature #11164: Garbage collector in Ruby 2.2 provokes unexpected CoW added

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#2 [ruby-core:109098]

Description updated (diff)

Another possible optimization I just found:

Strings have a lazily computed coderange attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an UNKNOWN coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

This also makes me think that this API isn't only useful for forking setup. Even if you use only threads or fibers, you may want to tell the VM that you are done loading and that it's now time to perform optimizations. So the API may use a more generic name.

Updated by ioquatix (Samuel Williams) almost 4 years ago Actions
Copy link
#3 [ruby-core:109227]

This is a really nice idea. My current implementation uses GC.compact during prefork stage, and it shows a big advantage. I'm happy to test any proposals with real world workloads.

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#4 [ruby-core:109339]

Another optimization that could be invoked from this method is malloc_trim.

Updated by Dan0042 (Daniel DeLorme) almost 4 years ago Actions
Copy link
#5 [ruby-core:109380]

I think the state of Copy-on-Write is already pretty decent, but any improvement is of course very welcome. As to naming, since this is mainly for preforking servers, what about Process.prefork?

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#6 [ruby-core:109381]

the state of Copy-on-Write is already pretty decent,

It depends how you look at it. In the few apps on which I optimized CoW as much as I could, only between 50% and 60% of the parent process memory is shared. That really isn't that good.

Updated by mame (Yusuke Endoh) almost 4 years ago Actions
Copy link
#7 [ruby-core:109409]

We discussed this issue at the dev meeting. We did not reach any conclusion, but I'd like to share some comments.

What and how efficient is this proposal?¶

Some attendees wanted to confirm quantitative evaluation of the benefits this proposal would bring.
@ko1 (Koichi Sasada) said that he created nakayoshi_fork as a joke gem. He didn't expect people to use it seriously, and he didn't have serious quantitative measurements.

(I've heard people say that memory usage has been reduced by nakayoshi_fork, but it would be nice to be properly confirm this advantage before introduction.)

How is it integrated with `Process._fork`?¶

Process._fork has been introduced as an zero-argument API. This API is supposed to be overridden, so we cannot add an argument easily.
If we keep Process._fork as is, we need to do some GC processes like nakayoshi_fork before the hook of Process._fork. Is it OK?

Are "short-lived" forks needed?¶

How much are "short-lived" forks used nowadays? The major use case where Process.exec is called shortly after Process.fork, is covered by Process.spawn.
If there is few use cases for "short-lived" forks, we may change the default behavior to "long-lived".
However, we sometimes use fork in tests, to invoke a temporal web server, for example. Calling GC whenever calling fork might be too heavy.

Is GC called whenever `fork(long_lived: true)` is called?¶

Here is a typical server code that uses fork:

loop do
  sock = servsock.accept
  if fork(long_lived: true)
    ...
  end
end

The parent process creates only a socket object for each iteration. It looks somewhat useless to call full GC in the parent process every time fork(long_lived: true) is called. A more intelligent strategy may be preferable here.

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#8 [ruby-core:109410]

@mame (Yusuke Endoh)

it would be nice to be properly confirm this advantage before introduction.

https://bugs.ruby-lang.org/issues/11164 is an example of how bad things can go without nakayoshi_fork (or similar). I can get production data from some of our apps if you wish, but the effect is going to be very app dependent, so I'm not sure if it's very relevant. You can craft demo-apps for which memory usage totally blow up if you don't promote objects to the old generation before forking.

How is it integrated with Process._fork?

Since I wrote this, I'm now convinced that it shouldn't be a fork argument, but a distinct API on RubyVM, since if you fork multiple workers you don't want to run them things again as it could invalidate CoW in previous workers.

So IMO RubyVM.prepare_for_long_lived_fork is the proper API (aside from the name which I doubt is desirable).

Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option. Hence the main reason of this proposal.

Updated by mame (Yusuke Endoh) almost 4 years ago Actions
Copy link
#9 [ruby-core:109417]

So IMO RubyVM.prepare_for_long_lived_fork is the proper API (aside from the name which I doubt is desirable).

After calling this, all fork calls are treated as "long-lived". Is my understanding right?

Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option.

I know that, but it seemed hard to me to convince the committers to change the API first for optimizations that have not been implemented yet and we don't know how effective they will be. IMO, it is good to focus on the use case of nakayoshi_fork since it is already implemented and used by not a few people. If there is a proper evaluation of the effect of nakayoshi_fork, it would be easier to persuade @matz (Yukihiro Matsumoto).

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#10 [ruby-core:109420]

After calling this, all fork calls are treated as "long-lived". Is my understanding right?

Well, this wouldn't change anything to the Process.fork implementation. I think I need to rewrite the ticket description because it is now confusing, I'll do it in a minute.

Also as said before I don't even think this is specific to forking servers anymore, I think RubyVM.make_ready or something like that would be just fine. Even if you don't fork, optimizations such as precomputing inline caching could improve performance of the first request.

it is good to focus on the use case of nakayoshi_fork

Ok, so here's a thread from when Puma added it as an option two years ago, https://github.com/puma/puma/issues/2258#issuecomment-630510423

After fixing the config bug in nakayoshi_fork, Codetriage is now showing about a 10% reduction in memory usage

Some other people report good numbers too, but generally they enabled other changes at the same time.

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#11 [ruby-core:109421]

Subject changed from Long lived fork advisory API (potential Copy on Write optimizations) to End of boot advisory API for RubyVM
Description updated (diff)

Ok, Ip updated the description, it's still very much focused on CoW, but hopefully it should now be more clear that's it's not the only benefit.

Also it now only ask a method on RubyVM, which could perfectly be marked as experimental, so API change concerns should be minimal.

Updated by Dan0042 (Daniel DeLorme) almost 4 years ago Actions
Copy link
#12 [ruby-core:109469]

I think the terminology used here might cause some confusion in the discussion.

"End of boot" makes it sound like this API would be useful for non-forking servers once they have finished their "boot" sequence. But from what I understand this is still very much a fork-specific API. Is there any point to precompute inline caches if there is no fork?

"Long lived" children processes are not really the point I think? Imagine a (ridiculous) architecture where the parent keeps spawning children and each child serves a single request before dying. Despite being short-lived, these processes would benefit from this API. So it's not about preparing children for being long-lived, it's about preparing the parent for having any children.

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#13 [ruby-core:109470]

Is there any point to precompute inline caches if there is no fork?

Yes, the first "request" (or whatever your unit of work is) won't have to do it. So you are moving some work to boot time, instead of user input processing time.

these processes would benefit from this API.

For the CoW parts no, not much. If the child isn't going to live for long, it's unlikely to invalidate that many pages.

Updated by matz (Yukihiro Matsumoto) almost 4 years ago Actions
Copy link
#14 [ruby-core:109528]

I am OK with adding this feature, but I have some concerns with the place and the name.
RubyVM is not globally available (e.g., not for JRuby or TruffleRuby). And I don't think prepare or ready describes the whole functionality.

Matz.

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#15 [ruby-core:109529]

Thank you Matz.

RubyVM is not globally available (e.g., not for JRuby or TruffleRuby).

Yes, what was on purpose because the behavior would be very VM specific, some VM might not even to have it. It's not meant to be a cross implementation feature.

And I don't think prepare or ready describes the whole functionality.

I'll try to come up with other names.

Updated by Eregon (Benoit Daloze) almost 4 years ago Actions
Copy link
#16 [ruby-core:109531]

An API to notify "end of boot" seems useful beyond just fork COW optimizations, as you say.
For instance a JIT might use that as a hint for what to compile/stop compiling/purge the queue during boot/reset compilation counters/etc.
So it shouldn't be under RubyVM which means only available on CRuby (forever).

Maybe a Kernel class method?

Kernel.booted/Kernel.application_booted/Kernel.code_loaded/Kernel.startup_done maybe?

Updated by byroot (Jean Boussier) almost 4 years ago Actions
Copy link
#17 [ruby-core:109533]

What about ObjectSpace?

Updated by byroot (Jean Boussier) over 3 years ago Actions
Copy link
#18 [ruby-core:109901]

So I wrote a reproduction script to showcase the effect of constant caches on Copy on Write performance:

class MemInfo
  def initialize(pid = "self")
    @info = parse(File.read("/proc/#{pid}/smaps_rollup"))
  end

  def pss
    @info[:Pss]
  end

  def rss
    @info[:Rss]
  end

  def shared_memory
    @info[:Shared_Clean] + @info[:Shared_Dirty]
  end

  def cow_efficiency
    shared_memory.to_f / MemInfo.new(Process.ppid).rss * 100.0
  end

  private

  def parse(rollup)
    fields = {}
    rollup.each_line do |line|
      if (matchdata = line.match(/(?<field>\w+)\:\s+(?<size>\d+) kB$/))
        fields[matchdata[:field].to_sym] = matchdata[:size].to_i
      end
    end
    fields
  end
end

CONST_NUM = Integer(ENV.fetch("NUM", 100_000))

module App
  CONST_NUM.times do |i|
    class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
      Const#{i} = Module.new

      def self.lookup_#{i}
        Const#{i}
      end
    RUBY
  end

  class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
    def self.warmup
      #{CONST_NUM.times.map { |i| "lookup_#{i}"}.join("\n")}
    end
  RUBY
end

puts "=== fresh parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

def print_child_meminfo
  meminfo = MemInfo.new
  puts "PSS: #{meminfo.pss} kB"
  puts "Shared #{meminfo.shared_memory} kB"
  puts "CoW efficiency: #{meminfo.cow_efficiency.round(1)}%"
  puts
end

fork do
  puts "=== fresh fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait

App.warmup

puts "=== warmed parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

fork do
  puts "=== warmed fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait

Results:

$ docker run -v $PWD:/app -it ruby:3.1 ruby /app/app.rb
=== fresh parent stats ===
RSS: 236104 kB

=== fresh fork stats ===
PSS: 117198 kB
Shared 233828 kB
CoW efficiency: 99.0%

PSS: 199734 kB
Shared 72740 kB
CoW efficiency: 30.8%

=== warmed parent stats ===
RSS: 237128 kB

=== warmed fork stats ===
PSS: 117632 kB
Shared 234880 kB
CoW efficiency: 99.1%

PSS: 118318 kB
Shared 235444 kB
CoW efficiency: 99.3%

What this shows¶

When we first fork the process, the memory cost is close to 0. The parent process has ~230MiB RSS, but 99% of that is shared with the first child, putting the actual cost of the fork at barely a couple MiB.

However as soon as we start executing code in the child that wasn't warmed up in the parent, the inline caches are being filled, which invalidates the shared pages. After that only a third of the parent memory is shared, putting the cost of the child at about 163MiB.

The second part of the reproduction first warmup these caches in the parent before forking. As a result the child doesn't invalidate shared memory when it execute the code, and the cost of the child remain totally negligible.

What it means for the real world¶

Of course this repro is specially crafted to show the impact of constant caches, there are other source of invalidations such as method caches etc, but as mentioned now that https://github.com/ruby/ruby/pull/6187 was merged, it should be easy to prewarm the constant caches when that proposed API is called.

I guess all we need is a name. Maybe ObjectSpace.optimize?

Updated by ioquatix (Samuel Williams) over 3 years ago Actions
Copy link
#19 [ruby-core:109989]

This is awesome. Nice work.

I also like warmup as a name.

Updated by Dan0042 (Daniel DeLorme) over 3 years ago Actions
Copy link
#20 [ruby-core:110045]

+1 for Process.warmup

Updated by matz (Yukihiro Matsumoto) over 3 years ago Actions
Copy link
#21 [ruby-core:110231]

Process.warmup sounds better than other candidates. My only concern is that the target of warming up might not be Process in the future (e.g. when Ractor local GC is introduced).

Matz.

Updated by byroot (Jean Boussier) over 3 years ago Actions
Copy link
#22 [ruby-core:110232]

Thank you Matz!

My only concern is that the target of warming up might not be Process in the future

Given the type of optimizations we have in mind right now, I think they'll still be global even on a Ractor heavy context. The main semantic of this signal is "I'm done loading my code" which doesn't change even with heavy Ractor use.

Updated by byroot (Jean Boussier) about 3 years ago 1Actions
Copy link
#23

Status changed from Open to Closed

Applied in changeset git|ba6ccd871442f55080bffd53e33678c0726787d2.

Implement Process.warmup

[Feature #18885]

For now, the optimizations performed are:

Run a major GC
Compact the heap
Promote all surviving objects to oldgen

Other optimizations may follow.

Updated by byroot (Jean Boussier) about 3 years ago Actions
Copy link
#24

Status changed from Closed to Open

Updated by ioquatix (Samuel Williams) about 3 years ago Actions
Copy link
#25 [ruby-core:113213]

Looking forward to using this.

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#26

Status changed from Open to Closed

Applied in changeset git|fa30b99c34291cde7b17cc709552bc5681729a12.

Implement Process.warmup

[Feature #18885]

For now, the optimizations performed are:

Run a major GC
Compact the heap
Promote all surviving objects to oldgen

Other optimizations may follow.

Project

General

Profile

Ruby

Custom queries

Feature #18885

End of boot advisory API for RubyVM

Context¶

Proposal¶

Potential optimizations¶

Precompute inline caches¶

Copy on Write aware GC¶

Scan the coderange of all strings¶

malloc_trim¶

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #1

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #2 [ruby-core:109098]

Updated by ioquatix (Samuel Williams) almost 4 years ago ActionsCopy link #3 [ruby-core:109227]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #4 [ruby-core:109339]

Updated by Dan0042 (Daniel DeLorme) almost 4 years ago ActionsCopy link #5 [ruby-core:109380]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #6 [ruby-core:109381]

Updated by mame (Yusuke Endoh) almost 4 years ago ActionsCopy link #7 [ruby-core:109409]

What and how efficient is this proposal?¶

How is it integrated with Process._fork?¶

Are "short-lived" forks needed?¶

Is GC called whenever fork(long_lived: true) is called?¶

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #8 [ruby-core:109410]

Updated by mame (Yusuke Endoh) almost 4 years ago ActionsCopy link #9 [ruby-core:109417]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #10 [ruby-core:109420]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #11 [ruby-core:109421]

Updated by Dan0042 (Daniel DeLorme) almost 4 years ago ActionsCopy link #12 [ruby-core:109469]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #13 [ruby-core:109470]

Updated by matz (Yukihiro Matsumoto) almost 4 years ago ActionsCopy link #14 [ruby-core:109528]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #15 [ruby-core:109529]

Updated by Eregon (Benoit Daloze) almost 4 years ago ActionsCopy link #16 [ruby-core:109531]

Updated by byroot (Jean Boussier) almost 4 years ago ActionsCopy link #17 [ruby-core:109533]

Updated by byroot (Jean Boussier) over 3 years ago ActionsCopy link #18 [ruby-core:109901]