Project

General

Profile

Actions

Feature #21950

open

Add a built-in CPU-time profiler

Feature #21950: Add a built-in CPU-time profiler
1

Added by osyoyu (Daisuke Aritomo) 1 day ago. Updated about 3 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:124971]

Description

Modern CRuby workloads can consume CPU concurrently across multiple native threads, especially with multiple Ractors and C exts which release the GVL. I'd like to propose the idea of integrating a built-in CPU-time profiler CRuby to enable more accurate and stable profiling in such situations.

Motivation & Background

CPU profilers indicate how much CPU time were consumed by different methods. Most CPU-time profilers rely on the kernel to track consumed CPU time. setitimer(3) and timer_create(3) are APIs to configure kernel timers. The process receives a profiling signal (SIGPROF) every given time of CPU time consumed (e.g. 10 ms).

In general, a profiler needs to know which thread consumed how much CPU time to attribute work to the correct thread. This wasn't a real requirement for Ruby in the pre-Ractor age, since only one native thread could consume CPU time at any given moment due to GVL limitations. Using process-wide CPU-time timers provided by setitimer() effectively did the job. It was safe to assume that the active thread was doing some work using the CPU when a profiling signal arrived.

Of course, this assumption does not stand in all situations. One is the case where C extensions release the GVL. Another is the multi-Ractor situation. In these cases, multiple native threads may simultaneously consume CPU time.

Linux 2.6.12+ provides a per-thread timer to address this task. Profilers such as Pf2 and dd-trace-rb use this feature to keep track of CPU time. However, utilizing this API requires information that CRuby does not necessarily expose. Both carry copies of CRuby headers in order to access internal structs such as rb_thread_t. This is quite fragile and possibly unsustainable in the age where CRuby evolves towards Ractors and M:N threading.

Proposal

Implement a built-in CPU-time profiler as ext/profile, and add some information extraction points to the VM built exclusively for it.

require 'profile'
RubyVM::Profile.start
# your code here
RubyVM::Profile.stop #=> results

ext/profile will take care of the following in coordination with some VM helpers:

  • Tracking creation and destruction of native thread (NT) s
  • Management of kernel timers for those threads
    • i.e. calling pthread_getcpuclockid(3) and timer_create(3)
    • This will require iterating over all Ractors and shared NTs on profiler init
  • Handling of SIGPROF signals which those timers will generate
  • Walking the stack (calling rb_profile_frames())
    • I'm not going to make this part of this ticket, but I'm thinking we can make rb_profile_frames() even more granular by following VM instructions, which is probably something we don't want to expose as an API

We would need to add some APIs to the VM for ext/profile:

  • An API returning all alive NTs
  • Event hooks notifying creation and destruction of NTs
  • Event hooks notifying assign/unassign of RTs to NTs

Since only Linux provides the full set of required kernel features, the initial implementation will be limited to Linux systems. I can expand support to other POSIX systems by employing setitimer() later, but they will receive limited granularity (process-level CPU-time timers).

Output interface

One thing to consider is the output format. I think we have a few practical choices here:

  • Adopt pprof's profile.proto format.
    • This is a widely accepted format across tools including visualizers.
    • The format itself is pretty flexible, so it shouldn't be hard to add custom fields.
  • Just return a Hash containing profile results.
    • We'd need to design some good format.

Things out of scope

  • Visualization
  • Interaction with external visualizers / trackers / etc

These can be better left to normal RubyGems.

Why not an external gem?

Through maintaining Pf2, an CPU-time profiler library, I have encountered many difficulties obtaining information required for accurate profiling. The rule of thumb is that more internal information a profiler can access, the more accuracy it can achieve. However, from the CRuby maintenance perspective, I suppose not too many APIs exposing implementation details are wanted.

Locating a profiler under ext/ looks like nice middle ground. Placing core 'profiling' logic (sampler scheduling, sampling itself) in CRuby and abstracting it as 'RubyVM::Profile' should work cleanly.

It should be noted that existing profilers have their own unique features, such as markers, unwinding of C stacks and integration with external APMs. I don't want to make this a tradeoff between accuracy and feature; instead, I'd like to design an API where both could live.

Study on other languages

A handful of VM-based languages carry profiler implementation in their runtime.

Among these, OpenJDK is a notable entry. JVM profilers have configured used AsyncGetCallTrace(), which is just like rb_profile_frames(), to obtain stack traces from signal handlers. The signal originates from kernel timers installed by the profilers, configured to fire every specified interval of CPU time (e.g. 10 ms).

Even though AsyncGetCallTrace() and async-profiler (its primary user) are very sophisticated and battle-tested, JFR folks have decided to control sampling timing within the runtime to improve accuracy and stability.

For more information on JVM, see:

Updated by headius (Charles Nutter) about 24 hours ago Actions #1 [ruby-core:124972]

JFR folks have decided to control sampling timing within the runtime to improve accuracy and stability

The other issue with real CPU timing is the heavy responsibility for accurately representing individual methods' CPU times without influencing those results and returning results in a reasonable amount of time. That's a big reason why sampling profilers are used more often for JVM applications.

Any CPU timing in Ruby must also consider JIT improvements over time, and be able to decode things like inlined method calls so that the reported execution time is associated with the correct body of code. Many of the CPU profiling tools for JVM either disable aggressive optimizations (giving a poor view of optimized code CPU time) or indirectly break those optimizations (by triggering more safepoints and more stack traces that change influence inlining heuristics.

I'd also caution against making this easily enabled by plain old Ruby code, because:

  1. People are definitely going to accidentally commit that stuff.
  2. JVM-based Ruby implementations generally won't support this, since for security and integrity reasons the low-level JVM CPU profilers usually have to be specified at boot time (or via service-loading mechanisms that are sometimes restricted).

Could the old dtrace hooks help this at all? That's an approach that would avoid users stumbling into deoptimization accidentally while also letting the kernel do most of the heavy lifting.

Updated by headius (Charles Nutter) about 24 hours ago Actions #2 [ruby-core:124973]

Also worth pointing out the ongoing efforts by the JIT team to eliminate artificial stack frames for leaf methods and some core method calls. As more artificial Ruby frames get elided, it will become harder to reconstruct the stack in a profiler. You can of course force those frames to be emitted, but then we're back to poisoning the profile.

Updated by osyoyu (Daisuke Aritomo) about 11 hours ago Actions #3 [ruby-core:124977]

Any CPU timing in Ruby must also consider JIT improvements over time, and be able to decode things like inlined method calls so that the reported execution time is associated with the correct body of code.

Yes, ZJIT probabaly would invalidate assumptions supporting the design of rb_profile_frames(). At the moment I'm unsure if some degree of deoptimization has to happen for profiling JITed sites, but even if so, I believe that kind of logic should live inside the runtime, not in external libraries.

People are definitely going to accidentally commit that stuff.

Are you mentioning APIs like RubyVM::Profile.enable ? This kind of APIs is what existing Ruby profilers have provided and people are used to, so I don't think there is a real risk here. Still though, I am OK with providing a option like --profile.

Updated by Eregon (Benoit Daloze) about 10 hours ago · Edited Actions #4 [ruby-core:124979]

I think this would be good, and I agree this would avoid several gems reimplementing this in brittle and not portable (e.g. only works on CRuby) ways.

See https://github.com/truffleruby/truffleruby/issues/2044#issuecomment-654848324 for discussion of how to integrate StackProf and TruffleRuby, such an API would help there.

The API should not be under RubyVM, that's CRuby-specific, cannot be supported on other Rubies since gems do if defined?(RubyVM) and assume other things like RubyVM::InstructionSequence exist, and generally a bad idea to put anything there.
It could be Ruby::Profiler, that seems the most obvious fit. Could be Ruby::SamplingProfiler to be extra clear.

I think a high-level API like start/stop + being able to specify the sampling frequency (which would still be limited to a maximum frequency to avoid too much overhead) would be good.
It's also similar to the StackProf API, which is widely used.

And then a way to retrieve the results.
Could be Ruby objects representing backtraces, maybe simply an Array of Array of Thread::Backtrace::Location?
As a note rb_profile_frames() API very much returns things similar to Thread::Backtrace::Location: https://docs.ruby-lang.org/capi/en/master/db/d16/debug_8h.html

Some already-serialized format like pprof is also possible, but then it requires every usage of this new API to be able to parse that, so it seems less convenient.
It could be more efficient/compact though due to less allocations or more efficient representation for the same method in many samples.

headius (Charles Nutter) wrote in #note-1:

Any CPU timing in Ruby must also consider JIT improvements over time, and be able to decode things like inlined method calls so that the reported execution time is associated with the correct body of code. Many of the CPU profiling tools for JVM either disable aggressive optimizations (giving a poor view of optimized code CPU time) or indirectly break those optimizations (by triggering more safepoints and more stack traces that change influence inlining heuristics.
Also worth pointing out the ongoing efforts by the JIT team to eliminate artificial stack frames for leaf methods and some core method calls. As more artificial Ruby frames get elided, it will become harder to reconstruct the stack in a profiler. You can of course force those frames to be emitted, but then we're back to poisoning the profile.

This proposal is about Ruby-level profiling, not Java/C-level profiling, conceptually it's like collecting data equivalent to caller_locations efficiently every X milliseconds.
caller_locations must be correct with or without JIT, so I think this is no concern, this needs to be figured out by the JITs regardless of this API.

headius (Charles Nutter) wrote in #note-1:

JVM-based Ruby implementations generally won't support this

This is not a concern at least for TruffleRuby, because this is Ruby-level profiling, not Java-level profiling.
In fact Truffle CPUSampler already provides such an API, it's just not exposed as a Ruby API yet.

Updated by osyoyu (Daisuke Aritomo) about 6 hours ago Actions #5 [ruby-core:124986]

Thank you for your comments and input!

Eregon (Benoit Daloze) wrote in #note-4:

The API should not be under RubyVM, that's CRuby-specific

RubyVM being CRuby-specific is actually the exact reason I choosed the place to locate a Profile feature. I am rather doubtful that CRuby/JRuby/TruffleRuby can share any profiler implementation at all. They are too different runtimes, and they can emit very different information. For example, a useful CRuby profiler would want to emit GVL contention status, which is a CRuby-specific concept.

Given this, my idea was that each Ruby implementation have its own profiling API (especially output). Perhaps they can share a high-level Ruby::Profiler.start/stop API, but I'm not sure how useful that would be.

headius (Charles Nutter) wrote in #note-1:

Could the old dtrace hooks help this at all? That's an approach that would avoid users stumbling into deoptimization accidentally while also letting the kernel do most of the heavy lifting.

For notifying when to collect samples, perhaps yes. For avoiding deoptimization, I think that is more of how sample collection is implemented within the VM, not the notification path.
(I'm not sure if I have understood your question correctly)

Updated by Eregon (Benoit Daloze) about 6 hours ago Actions #6 [ruby-core:124987]

osyoyu (Daisuke Aritomo) wrote in #note-5:

I am rather doubtful that CRuby/JRuby/TruffleRuby can share any profiler implementation at all.

I think the opposite, I think it's actually easy to implement such an API on TruffleRuby and I see no reason to not share it.

They are too different runtimes, and they can emit very different information.

Conceptually sampling profilers are mostly gathering backtraces of all threads at some frequency, no?
This is the same concept and very similar data for all Ruby implementations.

For example, a useful CRuby profiler would want to emit GVL contention status, which is a CRuby-specific concept.

TruffleRuby has a "GVL of sort" for C extensions which are not marked as Ractor/Thread-safe (called "C extensions lock"), so it could make sense there too.
In general contention/blocking is a useful concept to expose, as it also happens for Ruby-level locks like Mutex or more generally blocking for Queue/IO/etc.

Given this, my idea was that each Ruby implementation have its own profiling API (especially output). Perhaps they can share a high-level Ruby::Profiler.start/stop API, but I'm not sure how useful that would be.

What do you think of my suggestion to actually return Ruby projects rather than some pprof output?
An Array of Array of Thread::Backtrace::Location is totally doable across Ruby implementations.
And then some extra data for contention/blocking?

Updated by Eregon (Benoit Daloze) about 6 hours ago Actions #7 [ruby-core:124988]

osyoyu (Daisuke Aritomo) wrote in #note-5:

I am rather doubtful that CRuby/JRuby/TruffleRuby can share any profiler implementation at all.

They can't share the implementation of gathering backtraces (that's implementation-specific), sure, but the API should be shared, so whatever gems/Visualization on top can be shared too.

Updated by osyoyu (Daisuke Aritomo) about 5 hours ago Actions #8 [ruby-core:124989]

Eregon (Benoit Daloze) wrote in #note-6:

I am rather doubtful that CRuby/JRuby/TruffleRuby can share any profiler implementation at all.

I think the opposite, I think it's actually easy to implement such an API on TruffleRuby and I see no reason to not share it.

"Profiler implementation" meant (1) the sampling backend and (2) what information would be sampled. I see that TruffleRuby would also want to emit blocking time profiles, but that's only one case. I don't want the API/output format to restrict what a profiler could emit. I wouldn't say such design is not doable, but careful consideration would be needed.

What do you think of my suggestion to actually return Ruby projects rather than some pprof output?
An Array of Array of Thread::Backtrace::Location is totally doable across Ruby implementations.
And then some extra data for contention/blocking?

I'm OK with the idea of returning Ruby objects in general, but I don't think an Array of Array of Thread::Backtrace::Location itself is practical.
If a 50-level deep backtrace for 8 threads gets collected 100 times a second, profiling for 5 minutes would produce 120M objects, which is a pretty hard push on memory. Some sort of dedup must be done.

pprof/profile.proto is a format proven to be flexible and efficient enough for serializing practical Go application profiles in many modes, including CPU-time, heap, allocations, goroutines, mutex contention. It is also accepted by a number of existing polished visualizers (Perfetto, Grafana Pyroscope, Google Cloud Profiler to name a few).
Designing a Hash or Data structure based on profile.proto may be a good idea.

Updated by headius (Charles Nutter) about 3 hours ago · Edited Actions #9 [ruby-core:124990]

conceptually it's like collecting data equivalent to caller_locations efficiently every X milliseconds.

In fact Truffle CPUSampler

Unless I've read this issue completely wrong, it is not about adding a sampling profiler, it is about adding a CPU time profiler that does not sample.

A CPU time profiler is expected to show actual CPU time spent per method. That means accurately tracking execution time of every invocation, not periodically sampling to see what methods are on the stack. JRuby does ship such a profiler but it is of limited use due to the high overhead of tracking every method call and return.

Of course, if this is not actually a true CPU time profiler and simply a built-in sampling profiler, that's a much easier (albeit less accurate) feature to implement.

Updated by ivoanjo (Ivo Anjo) about 3 hours ago Actions #10 [ruby-core:124991]

The opentelemetry community project includes people from a number of companies building profilers (datadog, google, elastic, polar signals) and it's proposing a new format that's inspired by pprof but tries to solve for some of its shortcomings -- https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/profiles/v1development/profiles.proto . I'm one of the people at Datadog that's part of that group.

So this format might be an alternative to pprof; yet using pprof is not a problem either, since this format was built to allow easily conversion to/from pprof.

If anyone's curious, https://docs.google.com/document/d/1t1dz6ebaPONVrWCWy09j0f1n_Tgle7ZCrA8yqugY250/edit?usp=sharing is a blog post we're preparing with more details -- we plan to start stabilizing this soon.

Actions

Also available in: PDF Atom