Project

General

Profile

Actions

Feature #21950

open

Add a built-in CPU-time profiler

Feature #21950: Add a built-in CPU-time profiler
1

Added by osyoyu (Daisuke Aritomo) 1 day ago. Updated about 5 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:124971]

Description

Modern CRuby workloads can consume CPU concurrently across multiple native threads, especially with multiple Ractors and C exts which release the GVL. I'd like to propose the idea of integrating a built-in CPU-time profiler CRuby to enable more accurate and stable profiling in such situations.

Motivation & Background

CPU profilers indicate how much CPU time were consumed by different methods. Most CPU-time profilers rely on the kernel to track consumed CPU time. setitimer(3) and timer_create(3) are APIs to configure kernel timers. The process receives a profiling signal (SIGPROF) every given time of CPU time consumed (e.g. 10 ms).

In general, a profiler needs to know which thread consumed how much CPU time to attribute work to the correct thread. This wasn't a real requirement for Ruby in the pre-Ractor age, since only one native thread could consume CPU time at any given moment due to GVL limitations. Using process-wide CPU-time timers provided by setitimer() effectively did the job. It was safe to assume that the active thread was doing some work using the CPU when a profiling signal arrived.

Of course, this assumption does not stand in all situations. One is the case where C extensions release the GVL. Another is the multi-Ractor situation. In these cases, multiple native threads may simultaneously consume CPU time.

Linux 2.6.12+ provides a per-thread timer to address this task. Profilers such as Pf2 and dd-trace-rb use this feature to keep track of CPU time. However, utilizing this API requires information that CRuby does not necessarily expose. Both carry copies of CRuby headers in order to access internal structs such as rb_thread_t. This is quite fragile and possibly unsustainable in the age where CRuby evolves towards Ractors and M:N threading.

Proposal

Implement a built-in CPU-time profiler as ext/profile, and add some information extraction points to the VM built exclusively for it.

require 'profile'
RubyVM::Profile.start
# your code here
RubyVM::Profile.stop #=> results

ext/profile will take care of the following in coordination with some VM helpers:

  • Tracking creation and destruction of native thread (NT) s
  • Management of kernel timers for those threads
    • i.e. calling pthread_getcpuclockid(3) and timer_create(3)
    • This will require iterating over all Ractors and shared NTs on profiler init
  • Handling of SIGPROF signals which those timers will generate
  • Walking the stack (calling rb_profile_frames())
    • I'm not going to make this part of this ticket, but I'm thinking we can make rb_profile_frames() even more granular by following VM instructions, which is probably something we don't want to expose as an API

We would need to add some APIs to the VM for ext/profile:

  • An API returning all alive NTs
  • Event hooks notifying creation and destruction of NTs
  • Event hooks notifying assign/unassign of RTs to NTs

Since only Linux provides the full set of required kernel features, the initial implementation will be limited to Linux systems. I can expand support to other POSIX systems by employing setitimer() later, but they will receive limited granularity (process-level CPU-time timers).

Output interface

One thing to consider is the output format. I think we have a few practical choices here:

  • Adopt pprof's profile.proto format.
    • This is a widely accepted format across tools including visualizers.
    • The format itself is pretty flexible, so it shouldn't be hard to add custom fields.
  • Just return a Hash containing profile results.
    • We'd need to design some good format.

Things out of scope

  • Visualization
  • Interaction with external visualizers / trackers / etc

These can be better left to normal RubyGems.

Why not an external gem?

Through maintaining Pf2, an CPU-time profiler library, I have encountered many difficulties obtaining information required for accurate profiling. The rule of thumb is that more internal information a profiler can access, the more accuracy it can achieve. However, from the CRuby maintenance perspective, I suppose not too many APIs exposing implementation details are wanted.

Locating a profiler under ext/ looks like nice middle ground. Placing core 'profiling' logic (sampler scheduling, sampling itself) in CRuby and abstracting it as 'RubyVM::Profile' should work cleanly.

It should be noted that existing profilers have their own unique features, such as markers, unwinding of C stacks and integration with external APMs. I don't want to make this a tradeoff between accuracy and feature; instead, I'd like to design an API where both could live.

Study on other languages

A handful of VM-based languages carry profiler implementation in their runtime.

Among these, OpenJDK is a notable entry. JVM profilers have configured used AsyncGetCallTrace(), which is just like rb_profile_frames(), to obtain stack traces from signal handlers. The signal originates from kernel timers installed by the profilers, configured to fire every specified interval of CPU time (e.g. 10 ms).

Even though AsyncGetCallTrace() and async-profiler (its primary user) are very sophisticated and battle-tested, JFR folks have decided to control sampling timing within the runtime to improve accuracy and stability.

For more information on JVM, see:

Actions

Also available in: PDF Atom