Project

General

Profile

Actions

Feature #19119

open

Add an interface for out-of-process profiling tools to access Ruby information

Added by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago. Updated over 1 year ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:110689]

Description

Warning: this is a very long proposal :( I don't really know what the most productive way to engage with the Ruby community about this is - if there's a better place to take this, please let me know!

I also want to acknowledge @ivoanjo (Ivo Anjo) for contributing to this proposal - I bounced many ideas about profiling off him and the improved-backtrace stuff is all based on his backtracie gem.

Motivation

When dealing with large and old Ruby apps, there is often both a need to improve the performance of some operations, as well as a lack of understanding as to how, precisely, the code works, and what aspect of it is actually slow. Furthermore, the performance issues in question often only appear in a live, production environment, where there is some traffic pattern or user data causing the problem. In situations like this, profiling tools can prove invaluable in helping to understand application performance in-situ in production.

The things which an engineer might use a profiler to measure include:

  • Wall clock time; how long do particular method calls take?
  • CPU time; of the time a method takes, how long is spent actually running on the CPU, and how much is spent blocked waiting for e.g. a network response?
  • GVL contention; is an application using lots of threads just waiting for the GVL to become available?
  • GC time; is a particular piece of code actually slow, or is its slowness just a result of other code causing the GC to be invoked?
  • Allocations; how many objects is a particular piece of code allocating, and what impact is this having on GC?
  • Heap; of the objects a particular piece of code allocated, how many of them are still live? Is there some kind of leak?

Generally, in order to collect this information from production in a minimally-impactful way, a sampling approach is used. A profiling tool periodically takes a snapshot of the current call-stack of a Ruby program; depending on what is being measured, this periodicity might be e.g. simply every N ms (for wall-clock profiling), every N allocations (for allocation profiling), etc. Then, the data will generally be combined across many instances of a program to get an aggregate view (through e.g. a flame graph).

In addition to this fleet-wide aggregation, it's sometimes useful to profile a particular process which has suddenly started displaying some pathology of interest. For example, if you had a webserver process that had suddenly become stuck, you might want to attach a profiler to find out why. In scenarios like this, it's often impossible to make some adjustment to the program to enable profiling and restart it; the act of restarting it would make the problem go away (for now!).

Current state of the art

This represents my best understanding of the ways in which most Ruby profiling tools work today. I'm sure I've missed some important details and other tools, for which I'm deeply sorry.

Signal-based profiling

Signal-based profilers can be used to measure CPU and wall clock time. In this scheme, a profiler (in the form of an extension gem) sets up a signal handler for SIGPROF in C, and then arranges for SIGPROF to be delivered to the process repeatedly in some kind of loop. The signal handler then captures a stack trace using the rb_profile_frames API from the C extension and records this as a single sample.

This is the mechanism used for CPU/wall clock profiling in Stackprof, the Datadog continuous profiler, and I'm sure others. It's also a well-known mechanism used in other non-Ruby tools like gperftools

One benefit of this mechanism is that it should be quite fast. Triggering the signal causes a context switch into the kernel & back into the handler, but execution stays inside the current program; there's no need for the execution to switch to another program (and consequently change address space).

The downside of this approach is that it requires some level of cooperation from the program being profiled. Something inside the process must install the SIGPROF handler; if you have some existing, running Ruby process for which a profiler is not loaded, you cannot begin profiling it without restarting it (and using some mechanism like a -rmy_profiler/autoload flag).

Another problem with this approach is the effect these SIGPROF signals have on the application. Normally, the profiler installs the signal handler with the SA_RESTART flag; this should mean that any in-progress system call the application is doing when the SIGPROF arrives gets transparently restarted when the handler finishes collecting its sample. However, on Linux at least, there are some system calls which are NOT restarted when SA_RESTART happens - notably some network related ones like poll, epoll, and send/recv when a socket timeout is set. These system calls will instead fail with EINTR; well behaved programs and libraries SHOULD retry this.

Sadly, however, the repeated delivery of SIGPROF tens or hundreds of times per second does have a way of shaking out corner-cases in libraries where this isn't being done properly. For example, I have personally discovered the hard way that libmysqlclient version 5.7 does this wrong in one place, so enabling a SIGPROF-based profiler can cause libmysqlclient to erroneously think that database connections were closed. This is undoubtedly a bug in libmysqlclient (which, by the way, seems to be fixed in version 8), that could be triggered by any other signal. There's no denying though that SIGPROF-based profiling is a real torture-test for EINTR handling, and there undoubtedly are buggy libraries out there in the wild that people are using in their applications. It would be nice if Ruby profilers could work around this problem.

RUBY_INTERNAL_EVENT_NEWOBJ profiling

Collection of stacks from SIGPROF works for CPU/wall time profiling, but a different approach is needed for in-process profilers to collect allocation profiles. This is normally done by registering a tracepoint handler for RUBY_INTERNAL_EVENT_NEWOBJ in the profiler extension gem; this is called every time a Ruby object is allocated. From this handler, the profiler can use rb_profile_frames again to collect a Ruby backtrace of the code responsible for allocating this object.

Ptrace-based profiling

Another approach that has been used is to control the Ruby program to be profiled with another, separate process, using an appropriate platform API (e.g. ptrace on Linux). Periodically, the profiler process can use ptrace to suspend execution of the target Ruby program, and collect a stack trace of it. This is the approach used by the rbspy profiler, which is itself actually a Rust program.

The main strength of this approach is that it allows one to begin tracing a Ruby program which is already running, without restarting it; rbspy can ptrace-attach to a running process and begin collecting samples straight away. This allows one to gain insight into intermittent problems once they start happening!

Unfortunately, "collect a stack trace" is quite tricky to do from outside the process; we want the Ruby level stack trace, not (or at least, as well as) the machine stack trace (which could be collected e.g. via DWARF unwinding or frame-pointer chasing). To get the Ruby stack trace requires rummaging around inside the memory of the target process (e.g. with /proc/{pid}/mem on Linux), to find the relevant rb_control_frame_t structures. Rbspy contains a huge list of structure definitions for various versions of Ruby to make this possible, but any time a new version of Ruby is released, a new version of Rbspy must also be released to account for any changes to the layout of Ruby's internal structures.

In theory, the ptrace approach is also worse from a performance perspective; to collect a sample, execution must switch from the process to be profiled to the profiler process, which requires switching memory mappings. Recall that with in-process SIGPROF profiling, the profiler work is done in the same process with the same memory mappings. However, this concern seems to be largely theoretical; I have not actually seen or heard of anybody saying that this is making rbspy too slow for their usecase.

Finally, although ptrace-based profiling gets us close to the ideal of not requiring any foresight to begin profiling a running program, there is one small requirement which must be satisfied ahead of time - the Ruby interpreter needs to have not had its symbols stripped so that rbspy can find the ruby_current_vm_ptr global. I believe rbspy makes an attempt to find it without the symbol but it does not claim to be reliable.

Allocation sampling from ptrace-based profiling

To perform allocation profiling from outside the Ruby process, we would need some analogue to RUBY_INTERNAL_EVENT_NEWOBJ. I believe this could be achieved by compiling Ruby with dtrace/systemtap/USDT probes enabled, and having the profiler process write a trap instruction over the top of the probe (which is how e.g. debuggers work). I don't believe rbspy or any other profiler that I found actually does this though.

Perf-based profiling

Linux has a dedicated subsystem called perf which (amongst many other things) can do low-overhead profiling of arbitrary processes. The sampling is done inside the kernel, so like SIGPROF (and unlike ptrace) collecting a sample requires no switch of address space. However, of course, perf only understands the machine stack, not the Ruby stack, so using perf to profile a Ruby program is not currently extraordinarily useful (unless you're profiling the Ruby VM itself or a C extension).

Perf actually has a mechanism to help give meaningful stack traces for languages with JITs; the application is expected to write out a mapping of function address -> name to a file /tmp/perf-$PID.map; perf will use this file to symbolicate stacks. Python recently added support for this mechanism as well; however, because CPython does not have a JIT, it works by generating a native trampoline for each function (with a unique address), and calling the appropriate trampolines during Python method execution to make the machine stack mirror the VM one. This mechanism is not implemented in CRuby currently.

The proposal: provide an "external debug section"

We ideally would like to have a way to profile Ruby applications which combines the benefits of the SIGPROF mechanism (easy, documented access to Ruby call stacks), the ptrace mechanism (the ability to attach to running processes with a minimum of foresight required), and the perf mechanism (low overhead sample collection without requiring a process address space switch). This proposal aims to provide exactly that.

The Ruby interpreter would have a new global variable rb_debug_ext_section, which is defined with __attribute__((section("rb_debug_ext"))) so that it appeared in a separate section in the ELF binary. This variable would be constantly kept populated with up-to-date information about all the call-stacks of all threads/fibers in the program. An external debugging tool could easily ("easily") ptrace a running Ruby program, and find the address of this section in its running memory, even if symbols have been stripped. Once an external tool has found this variable, it needs to make sense of it. Ruby would provide a new header file include/ruby/debug_external.h which described the layout of this structure.

The simple, slow version of an external profiler using this interface would be a lot like rbspy; it would use ptrace to periodically stop the Ruby process, and then use the contents of rb_debug_ext_section to collect Ruby backtraces for CPU/wall-clock profiling. The benefit, of course, is that it's using a documented interface that can be versioned and kept backwards compatible, rather than having to constantly chase breaking changes to internal Ruby VM structures.

However, the benefit of this approach is that we can do better! The perf subsystem on Linux is able to invoke an eBPF program to actually collect samples (using, I think, PERF_EVENT_IOC_SET_BPF but I'm not 100% sure) and write them to a ring-buffer. The eBPF program can actually use the debug_external.h header file to figure out the current Ruby stack trace(s) itself, without needing to switch back into the profiler address space! The profiler program only needs to run periodically to empty the ring-buffer and write samples to disk.

The rider: better stack trace output

So far, we've mentioned that the profiler process will extract a stack trace out of rb_debug_ext_section. Let's turn our attention to what that stack trace would actually be.

Traditional profiling tools for non-dynamic languages usually work by capturing memory addresses for functions, and then deferring the translation of addresses -> names until later (possibly even on a different machine). Ruby, however, is I believe much too dynamic for a similar approach. Methods and classes can be defined and undefined dynamically over the lifetime of the program. Instruction sequences can be created and GC'd as a result of using things like eval. A memory address occupied by one class or iseq might later be occupied by a different one - especially because we now have GC compaction.

Thus, I think the only sensible thing to do is to store strings in the rb_debug_ext_section structure. These strings need to be valid, essentially, whilst the called method/iseq is still on the stack - there's no requirement to retain them until later. The external tool would need to copy these strings into its own process.

What should those actually contain? A traditional Ruby backtrace contains a file, line number, a method name (if available), and some other details about the iseq if present (e.g. whether it's a block or eval, whether it's , etc).

irb#1(main):003:0> def foo; raise "oh no"; end;
=> :foo
irb#1(main):004:0> foo
(irb#1):3:in `foo': oh no (RuntimeError)
	from (irb#1):4:in `<main>'

I do not believe this is the most useful way to present this information from a profiling perspective, however. When dealing with a large monolithic application, profiles will be collected process-wide and aggregated on some kind of central dashboard; there could be many methods called the same thing (think def create in a Rails controller, for example). The filename can help disambiguate in circumstances like this, but C functions don't even have a filename, so just print the filename of the thing that called them:

irb#1(main):005:0> 1.times { raise "oh no" }
(irb#1):5:in `block in <main>': oh no (RuntimeError)
	from (irb#1):5:in `times' # I did not define Integer#times in IRB!
	from (irb#1):5:in `<main>'

All of this can be generally figured out by looking at the code, but often when we're looking at e.g. a profiling dashboard, having to keep looking between the flame graph and the code can be rather annoying.

It would be nice if instead, our profiles contained the name of the class on which a method is defined; something like FooClass#the_method. Ruby has a method rb_profile_frame_qualified_method_name which can give useful output like this in many cases. However, in many common cases like e.g. anonymous classes, it will present the name of the class as something like "#Class:0x0000000000db48e0". This is worse than useless for profiling; that address will be different amongst different processes being profiled, so multiple profiles can't be aggregated together. In fact, that address isn't even guaranteed to be constant within the same process anymore, because of compaction.

I would propose that we have a new way of naming methods which produces useful strings in all cases, without any addresses in them. The rules I have come up with so far are:

  • "#<refinement Foo of Bar>" for a refinement module adding methods to
    Bar
  • "#<instance of Foo>" for a particular instance of Foo
  • "#<singleton of Foo>" for Foo's singleton class.
  • "#<anonymous subclass of Foo>" for an anonymous subclass
  • The usual classpath string otherwise

Note that nesting is OK too - so a method defined on the singleton class of a particular Foo object would be "#<singleton of #>#the_method".

So, my proposal is that this is the format of the method name that appears in the rb_debug_ext_section structure.

Where have I got to so far?

OK, so I have a draft PR up which should hopefully help make some of this clearer: https://github.com/ruby/ruby/pull/6706. It is of course in no way even close to merge ready, but I'm hoping it helps illustrate some of my ideas here:

  • The debug_external.h file that defines the externally-visible structures
  • Where threads are added and removed from the rb_debug_ext_section structure
  • Where frames are pushed and popped from the rb_debug_ext_section structure
  • The tests illustrate what the new qualified method name output looks like.

What do I want feedback on?

This whole project is obviously still about 20% cooked, so, anything and everything. But more specifically, these are the questions I would really like some help with :)

  • Does this approach seem sensible? Is the basic idea of "expose some documented data structure to other processes that are able to read my memory" a sound one?
  • What should I do with the qualified-method-name-without-addresses work? At the moment I exposed that logic behind new methods Method#debug_name and Thread::Backtrace::External#debug_label. Is there some appetite to open a proposal to have that in CRuby?
  • Should we even consider changing the default output of thread backtraces to print these qualified names?
  • There are some 'infrastructural' commits on that branch of mine - one to find some extra space in rb_method_entry_t and one to add some new atomic helpers. Is this the sort of thing that would make sense to try and land now or only when this is all closer to done?
  • I haven't yet actually implemented anything to try and read this external debug section. Should it be
    1. some external repo, in a non-Ruby language e.g. Rust or something?
    2. A Ruby gem, with a C extension to handle all the ptrace/bpf'y bits?
    3. The gem, but inside the CRuby source tree?
      I'm leaning towards 2, but not really sure.

Thank you, to anybody who actually got to the bottom of this wall of text, and thanks doubly to anybody who provides feedback!

Actions #1

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago

  • Description updated (diff)

Updated by mame (Yusuke Endoh) over 1 year ago

Thanks for the motivation and proposal. I think it's a good idea for Ruby to provide a mechanism for monitoring.

I understood this ticket as follows.


What we want to do: get Ruby stack trace from an external process in a fast and stable way

Current options (and problems):

  • Using rb_profile_frames (hard to call it from outside of the interpreter process)
  • Analyzing rb_control_frame_t, the internal representation of the stack frames (hard to support multiple Ruby versions)

Proposal: manage clean data structure representing the stack frames separately from rb_control_frame_t, and read it from an external process.


I think the proposal has two major concerns.

One is overhead. It seems very wasteful to always manage two data structures representing the stack frames.

The other is data synchronization. An external process may get corrupted data if it attempts to read during a method call (i.e., when the data of the stack frames is being updated). To prevent this, the Ruby interpreter should be suspended at a safe place before reading. However, such a synchronization may increase the overhead even more.

So, I understand the motivation, but it seems like a better API design is needed (@ko1 (Koichi Sasada) seems to have some ideas). It might be a good idea to investigate whether and how other interpreters have solved this problem.

Updated by byroot (Jean Boussier) over 1 year ago

the Ruby interpreter should be suspended at a safe place before reading.

That is what stackprof has been doing historically, the problem with this is that it tend to skew the profiling results and attribute more weight to methods implemented in C than regular methods.

Since about a year ago however, when on Ruby 3.0+ (and when YJIT is not enabled), stackprof does collect the stacktrace directly from the signal handler (see https://github.com/tmm1/stackprof/pull/150 cc @jhawthorn (John Hawthorn) & @tenderlovemaking (Aaron Patterson)).

Updated by mame (Yusuke Endoh) over 1 year ago

byroot (Jean Boussier) wrote in #note-3:

Since about a year ago however, when on Ruby 3.0+ (and when YJIT is not enabled), stackprof does collect the stacktrace directly from the signal handler (see https://github.com/tmm1/stackprof/pull/150 cc @jhawthorn (John Hawthorn) & @tenderlovemaking (Aaron Patterson)).

I did not know about this change. I wonder if the signal handler is executed in an incomplete state. Anyway, I think the execution of the interpreter (at least, the main thread) is stopped when the singal handler is executed.

I think the problem of this ticket's proposal is more difficult because the interpreter runs in parallel even when the monitor process attempts to peek the stack trace.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago

Thank you for your feedback!

mame (Yusuke Endoh) wrote in #note-2:

I understood this ticket as follows.


Yes, your summary is exactly right.

One is overhead. It seems very wasteful to always manage two data structures representing the stack frames.

Yes, overhead is the thing I'm most concerned about, especially as the externally-accessible data structure would need to be kept up-to-date all the time in order for "attach to a running process" to work.

I've been using the yjit-bench tooling to measure the impact on my proof-of-concept branch. The results from this naive implementation didn't seem immediately disqualifying, but there's way more work to do and I haven't run it in a totally scientific, systematic way for sharing yet. I think yjit-bench would form a good basis for a yes/no decision on merging an implementation of this proposal? (I would ideally like to run it in a real-world Rails app, but sadly Zendesk (where I work) is still on Ruby 2.7/Rails 5.1... :/ ).

My current plan for doing this in a way that minimises overhead is to:

  • Create all the strings/data structures that are supposed to be externally visible at method-definition time, and store them on the rb_method_entry_t structure (& something similar at compile time for iseq's without methods, I suppose).
  • Embed that extra data structure inside rb_control_frame_t; essentially, when pushing a new rb_control_frame_t onto the VM stack, we also set its "ext_debug" field to contain the new data structure.
  • Then, for external code to inspect the Ruby stack, it just needs to walk through the VM stack frames at some known offset (where the "ext_debug" field lives). This offset can be discovered at runtime by the external tool, because Ruby will store sizeof(rb_control_frame_t) and offsetof(rb_control_frame_t, ext_debug) in the rb_debug_ext_section ELF section.

i.e. - the data structure in the new, rb_debug_ext_section section will contain pointers to other parts of Ruby's memory. An external tool would need to be able to read any of Ruby's memory (with e.g. ptrace). The new section just gives the external tool a starting point and the knowledge of where to look for e.g. the VM stack frames. I kind of figured it might be nice to have everything copied to the new section, but it would be prohibitively slow.

All this said, I very much accept that a possible future here is that people think this proposal is a good idea in theory, so I go away and cook up a production quality implementation of it, and it turns out to just be too slow and can't be merged.

The other is data synchronization. An external process may get corrupted data if it attempts to read during a method call (i.e., when the data of the stack frames is being updated). To prevent this, the Ruby interpreter should be suspended at a safe place before reading. However, such a synchronization may increase the overhead even more.

I think this actually won't be a problem (although I definitely need to prove this with an implementation of an external profiler using this interface). My idea is that after pushing a stack frame to the VM stack, we would then change the "current stack frame" in the external debug section to point to the new cfp->ext_debug field. So long as that update was done with an atomic, then it should be safe to suspend the Ruby process at literally any CPU instruction and read a consistent version of the new stack representation.

That's what's happening on this line of my sketch.

byroot (Jean Boussier) wrote in #note-3:

the problem with this is that it tend to skew the profiling results and attribute more weight to methods implemented in C than regular methods.

My hope is I can avoid this problem too by making it OK for the profiler to suspend Ruby at any arbitrary CPU instruction - that's most likely true at the method level, but I haven't 100% figured out how to get line-level information.

Actions #6

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

  • Tracker changed from Bug to Feature
  • Backport deleted (2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN)

Updated by Eregon (Benoit Daloze) over 1 year ago

kjtsanaktsidis (KJ Tsanaktsidis) wrote:

This variable would be constantly kept populated with up-to-date information about all the call-stacks of all threads/fibers in the program.

This sounds like a huge overhead to me.
And also this approach is completely unrealistic for JRuby/TruffleRuby/etc, where keeping this info in native memory all the time would be even more overhead.
Also it's really heavy because this is overhead on every call, not even only during the sample every N milliseconds.
That's also going to cause a lot of bias in the results, because small methods (e.g. Integer#+) will spend a significant amount of time just keeping this information.

The approach using perf as in CPython seems much simpler, but is a little bit hacky and not portable (only works on Linux).

For TruffleRuby I have written some thoughts on how to profile efficiently and provide a Ruby API for it in https://github.com/oracle/truffleruby/issues/2044#issuecomment-654848324.
I think the most interesting idea there is:

Or we could even let CPUSampler capture every millisecond for some duration and then get all the data out at once, that would be a lot more efficient and avoid affecting (and potentially bias) the application with the extra logic to capture at a frequent interval.

So basically there would be some core API to start profiling (and it would sample every N milliseconds), stop profiling, and retrieve the results.
That's agnostic of Ruby VM details, and can use the most efficient way to profile in-process.

To interact with that from another process we could do multiple things, one would be to use a signal to start profiling, and then use the signal again to stop profiling and store the result in some well-known native global variable in a well-known format. We'd need a way to wait until that result is fully stored as signals don't wait of course.
Or each Ruby process could set up some file or socket or so to interact with it from the outside, like the JVM does with JMX/jvmstat/etc.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago

Eregon (Benoit Daloze) wrote in #note-7:

So basically there would be some core API to start profiling (and it would sample every N milliseconds), stop profiling, and retrieve the results. That's agnostic of Ruby VM details, and can use the most efficient way to profile in-process.

OK - it sounds like rather than trying to push the actual implementation of the profiling off to a separate out-of-tree tool, that you think the Ruby VM should just take responsibility for doing the profiling itself (and potentially exposing the results via both in-process APIs & out-of-process APIs like sockets etc).

For TruffleRuby I have written some thoughts on how to profile efficiently and provide a Ruby API for it in https://github.com/oracle/truffleruby/issues/2044#issuecomment-654848324.

Thank you for pointing this out to me. I hadn't considered at all non-Cruby implementations but yeah if the high-level interface is "dump a CPU sample (or a whole profile, as you suggested) in a standard format" that can be implemented in any way appropriate for a given Ruby implementation.


What I'm hearing is, rather than exposing Cruby internals in a way that can be reliably consumed by external profiling tools, it would be better to implement most of the profiling logic inside the Ruby interpreter itself. This would avoid adding a new, special, brittle API based on reading process memory, plus ensuring that other Rubies can provide sensible implementations too. The downside is of course that it results in more code in-tree (including platform-specific code - random example, perhaps using eBPF on Linux if it's available & supported by the host system) - but that seems like a sensible trade-off.

Sounds like I should go back to the drawing board, have a shot at implementing this kind of in-tree profiler, and see if I can do so whilst overcoming some of the noted shortcomings of the stackprof/rbspy approaches from this thread. I'll also use what's already existing in truffleruby as a basis for API. Then, if doing that looks possible, I can open a new issue proposing adding that to Cruby?

Thanks again everyone for your feedback here - I'm obviously new to the world of hacking on Ruby so I appreciate you all taking the time to read through this and suggest some alternate approaches.


Small, mostly irrelevant sidebar about perf:

The approach using perf as in CPython seems much simpler, but is a little bit hacky and not portable (only works on Linux).

I'm not really sure this is true - the CPython/perf approach pushes a frame onto the native stack (i.e. the native trampoline for that Ruby function) mirroring every frame pushed onto the VM stack. It seems to me like this would be strictly more overhead than just pushing a few extra bytes onto the VM stack whilst we're already writing the rest of rb_control_frame_t there. I can certainly try this out though.

Actions

Also available in: Atom PDF

Like3
Like0Like0Like0Like0Like0Like0Like0Like0