Project

General

Profile

Actions

Feature #19541

closed

Proposal: Generate frame unwinding info for YJIT code

Added by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago. Updated 6 months ago.

Status:
Feedback
Assignee:
Target version:
-
[ruby-core:112944]

Description

What is being propsed?

Currently, when Ruby crashes with yjit generated code on the stack, rb_print_backtrace() is unable to actually show any frames underneath the yjit code. For example, if you send SIGSEGV to a Ruby process running yjit, this is what you see:

/ruby/miniruby(rb_print_backtrace+0xc) [0xaaaad0276884] /ruby/vm_dump.c:785
/ruby/miniruby(rb_vm_bugreport) /ruby/vm_dump.c:1093
/ruby/miniruby(rb_bug_for_fatal_signal+0xd0) [0xaaaad0075580] /ruby/error.c:813
/ruby/miniruby(sigsegv+0x5c) [0xaaaad01bedac] /ruby/signal.c:919
linux-vdso.so.1(__kernel_rt_sigreturn+0x0) [0xffff91a3e8bc]
/ruby/miniruby(map<(usize, yjit::backend::ir::Insn), (usize, yjit::backend::ir::Insn), yjit::backend::ir::{impl#17}::next_mapped::{closure_env#0}>+0x8c) [0xaaaad03b8b00] /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:929
/ruby/miniruby(next_mapped+0x3c) [0xaaaad0291dc0] src/backend/ir.rs:1225
/ruby/miniruby(arm64_split+0x114) [0xaaaad0287744] src/backend/arm64/mod.rs:359
/ruby/miniruby(compile_with_regs+0x80) [0xaaaad028bf84] src/backend/arm64/mod.rs:1106
/ruby/miniruby(compile+0xc4) [0xaaaad0291ae0] src/backend/ir.rs:1158
/ruby/miniruby(gen_single_block+0xe44) [0xaaaad02b1f88] src/codegen.rs:854
/ruby/miniruby(gen_block_series_body+0x9c) [0xaaaad03b0250] src/core.rs:1698
/ruby/miniruby(gen_block_series+0x50) [0xaaaad03b0100] src/core.rs:1676
/ruby/miniruby(branch_stub_hit_body+0x80c) [0xaaaad03b1f68] src/core.rs:2021
/ruby/miniruby({closure#0}+0x28) [0xaaaad02eb86c] src/core.rs:1924
/ruby/miniruby(do_call<yjit::core::branch_stub_hit::{closure_env#0}, *const u8>+0x98) [0xaaaad035ba3c] /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492
[0xaaaad035c9b4]

(n.b. - I compiled Ruby with -fasynchronous-unwind-tables –rdynamic –g in cflags to make sure gcc generates appropriate unwind info & keeps the symbol tables).

Likewise, if you attach gdb to a Ruby process with yjit enabled, gdb can't show thread backtraces through yjit-generated code either.

My proposal is that YJIT generate sufficient unwinding and debug information on all platforms to allow both rb_print_backtrace() and the platform's debugger (gdb/lldb/WinDbg) to show:

  • Full stack traces all the way back to main. That is, it should be possible to see frames underneath [0xaaaad035c9b4] from the backtrace above.
  • Names for the dynamically generated yjit blocks (e.g. instead of [0xaaaad035c9b4], we should see something like yjit$$name_of_ruby_method, where name_of_ruby_method is the label for the iseq this is JIT'd code for).

Motivation

I have a few motivations for wanting this. Firstly, I feel this functionality is independently useful. When Ruby crashes, the more information we can get, the more likely we are to find the root cause. Likewise, the same principle applies to debugging with gdb - you can get a fuller understanding of what the process is doing if you see the whole stack.

I have often found attaching gdb to the Ruby interpreter helps in understanding problems in Ruby code or C extensions and is something I do relatively frequently; yjit breaking that will definitely be inconvenient for me!

Implementation

I have a draft implementation here on how I'd implement this: https://github.com/ruby/ruby/pull/7567. It's currently missing tests & platform support (it only works on Linux aarch64). Also, it implements unwind info generation, so unwinding can work through yjit code, but it does not currently emit symbols to give names to those yjit frames.

My PR contains a document which explains how the Linux interfaces for registering unwind info for JIT'd code work, so I won't duplicate that information here.

The biggest implementation question I had is around the use of Rust crates. Currently, I prototyped my implementation using the gimli & object crates, for generating DWARF info and ELF binaries. However, the yjit build does purposefully does not use cargo & external crates for release builds. There are a few different ways we could go here:

  • Don't use the gimli & object crates; instead, re-implement all debug info & object file generation code in yjit.
  • Don't use the crates; instead, link againt C libraries to provide this functionality & call them from Rust (perhaps some combination of libelf, libdw, libbfd, or llvm might do what we need)
  • Use cargo after all for the release build & download the crates at build-time
  • Use cargo for the release build, but vendor everything, so the build doesn't need to download anything
  • Only make unwind info generation available in dev mode where cargo is used, and so mark the gimli/object dependencies as optional in Cargo.toml.

We'd need to decide on one of these approaches for this proposal to work. I don't really have a strong sense of the pros/cons of each.

(Side note - my PR actually depends on a fork of gimli - I've been discussing adding the needed interfaces upstream here: https://github.com/gimli-rs/gimli/issues/648).

Benchmarks

I ran the yit-bench suite on my branch and compared it to Ruby master:

This is a (very simple) comparison:

-------------- ------------ ------------ ---------------
bench          yjit (ms)    branch (ms)  branch/yjit (%)
activerecord   97.5         98.5         101.03%
hexapdf        2415.3       2458.2       101.78%
liquid-c       61.9         63.1         101.94%
liquid-render  135.3        135.0        99.78%
mail           104.6        105.5        100.86%
psych-load     1887.1       1922.0       101.85%
railsbench     1544.4       1556.0       100.75%
ruby-lsp       88.4         89.5         101.24%
sequel         147.5        151.1        102.44%
binarytrees    303          305.6        100.86%
chunky_png     1075.8       1079.4       100.33%
erubi          392.9        392.3        99.85%
erubi_rails    14.7         14.7         100.00%
etanni         792.3        791.4        99.89%
fannkuchredux  3815.9       3813.6       99.94%
lee            1030.2       1039.2       100.87%
nbody          49.2         49.3         100.20%
optcarrot      4142         4143.3       100.03%
ruby-json      2860.7       2874.0       100.46%
rubykon        7906.6       7904.2       99.97%
30k_ifelse     348.7        345.4        99.05%
30k_methods    828.6        831.8        100.39%
cfunc_itself   28.8         28.9         100.35%
fib            34.4         34.5         100.29%
getivar        115.5        109.7        94.98%
keyword_args   37.7         38.0         100.80%
respond_to     26           26.1         100.38%
setivar        33.8         33.5         99.11%
setivar_object 208.7        194.3        93.10%
str_concat     52.6         52.2         99.24%
throw          23.8         24.1         101.26%
-------------- ------------ ------------ ---------------

It seems like the performance impact of generating and registering the debug info is marginal.

Actions #1

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago

  • Description updated (diff)
Actions #2

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago

  • Description updated (diff)

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago

A thought crossed my mind - I wonder if this should actually be implemented in the C parts of ruby, rather than in rust. so it can be shared with RJIT? Or is debug object generation something each jit should do for itself?

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago

  • Status changed from Open to Assigned
  • Assignee set to yjit

Updated by k0kubun (Takashi Kokubun) about 1 year ago

I wonder if this should actually be implemented in the C parts of ruby, rather than in rust.

RJIT's goal is to help YJIT. We shouldn't consider writing something in C instead of Rust just for RJIT. We should choose what's the best for YJIT, and RJIT could separately maintain it if necessary.

Updated by alanwu (Alan Wu) about 1 year ago

Thank you for looking at this. You clearly put in a lot of effort. However,
this proposal conflates too many concerns, while the goals are related, the
solution to solve each one have different constraints. I suggest sending
smaller proposals in the future.

I'll respond to just the unwinding concern here, because that has an
implemented proof-of-concept. At first blush, for solving a debugging concerns,
the added complexity from depending on the massive glimi crate feels bad.
Also, the need to generate ELF objects in-memory is antithetical to YJIT's goal
of keeping memory consumption low.

For the goal of providing unwindability in release builds, generating DWARF and
ELF objects in memory is more complex than needs to be. DWARF unwind is very
expressive, way more powerful than what we need to unwind through YJIT
generated frames. The unused complexity show up as extra memory consumption.
The Linux kernel has its own unwinding format partly because DWARF is more
complex than what they need. What YJIT needs is even simpler than what Linux
needs. Since you mentioned WinDbg, unwindability is technically an ABI
requirement on Windows. The interface there doesn't require pre-registration
for each piece of code; it simply calls back when unwinding needs to happen.
That interface, combined with a designed prologue, should allows for unwinding
through generated frames without any extra metadata. This is ideal memory
consumption wise.

For platforms YJIT already supports, we might have no choice but to register
code before hand. Registering using the GDB interface seems less than ideal,
though. It requires generating ELF objects in-memory, which is bad for memory
consumption, and it's also known to be not have the best speed. For cases
where Ruby already links with libunwind (some Linux distros and BSDs), we can
register with its dynamic interface, or use it to teach addr2line.c how
to unwind through YJIT frames without needing to generate extra metadata.

Note that on A64 macOS, because Apple mandates frame pointer unwinding,
LLDB already unwinds through YJIT frames just fine. We generate the same code
on A64 Linux with GNU userspace, but the same guarantee doesn't exist there.

In summary, I do agree that we should try to give fuller backtraces in the bug
reporter and help debuggers, but if the proposal is "let's use a bunch of
memory and take on a few big dependencies to do it", then the answer is no.
That competes with and undermines too many other goals.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago

Thanks Alan for your feedback and clarifying YJIT's goals for me.

First off, let me confirm I'm on the same page as you about a couple of things.

I totally agree the unwind-info-registration API's in GNU land are awful. Windows does this way better with RtlInstallFunctionTableCallback - it covers both in-process and out-of-process unwinding in a lazy way. Alas, not what we have available on GNU/Linux.

I agree with your premise that YJIT does not muck around with the stack in very creative ways, and very little information is actually needed to unwind through YJIT frames. My approach in my POC was to make Ruby use the most obvious and well-exercised platform APIs for registering unwind info, which to me seemed to be __register_frame and __jit_debug_register_code. That is, I went through the rigmarole of DWARF CFI & ELF generation to try and be a good platform citizen.

I also agree however that this is pretty heavyweight - the ELF file generation especially because it has to be regenerated periodically if it runs out of "free" space to jam more debuginfo in there.

Finally, I also acknowledge that adding Rust dependencies increases compile times & is a huge pain for downstream distributors etc, and you've gone to quite some effort to not do that - I assume these are the main issues with actually adding the gimli/object dependencies per-se?

The general "vibe" I get from your feedback is that we don't want to introduce huge implementation complexity just to make YJIT use the "standard" unwinding mechanisms; rather, we should actually implement the simplest thing that works for YJIT, and then tailor that to platform interfaces.

One final thing to clarify though:

For cases where Ruby already links with libunwind (some Linux distros and BSDs), we can register with its dynamic interface

If you're referring to UNW_INFO_FORMAT_DYNAMIC info, that's actually totally unimplemented in libunwind for anything except Itanium (which... I assume is not a target YJIT wants to support xD ). UNW_INFO_FORMAT_TABLE works AFAICT, but requries generating DWARF CFI info (which is something we'd like to avoid).


OK, so what can we do that satisfies the following constraints?

  1. Lets us unwind stacks containing YJIT frames in both GDB and the crash reporter
  2. Does not require us to construct complex in-memory structures which are really designed for on-disk use (i.e. no ELF files)
  3. Does not require us to use DWARF CFI (which is far too complex for the simple stacks that YJIT lays out)
  4. Has very little runtime CPU cost to construct and register
  5. Has very little runtime memory cost to have hanging around

I think I have a rough idea of something that might fit the bill.

Firstly, let's have YJIT generate a "compact unwind info format" of our own. I definitely need to experiment with implementation before being too specific here, but roughly...

  • There would actually be two separate tables - one for inline, and one for outline.
  • It would be sorted by IP
  • It would be only appended to when code is generated - this is because (normally) the IP of generated code for each code block only increases. This means hopefully a minimum of gratuitous memcpy'g around of data (except for when it needs to grow).
  • Need to do something about Code GC, which violates the "IP only increases" invariant. Since Code GC frees only whole pages, perhaps the unwind info could be per-page, and the pages would be stored in a hash table. That would make it O(1) both to get the right block of unwind info to append to when generating code, as well as when looking up the unwind info for a given IP.
  • For each block, the unwind info would store:
    • Start/end IP of the block
    • Whether or not this block has a frame_setup prologue
    • Whether or not this block has a frame_teardown epilogue
    • Whether or not this block is split into the next inline/outline page as well
    • ~A pointer to the iseq structure~ (this can come later - it'd be needed for naming the block, but also introduces some fun GC mark/compaction issues).

If we're allowed to rely on the frame pointer being setup [1], and the shape of our prologue/epilogues, I think that's all the information needed to do frame unwinding.

[1] This would mean we'd need to add it to x86_64 code generation. The register isn't actually used for any of YJIT's generated code for any other reason, so I doubt it'll have a big performance impact.

Now, how do we connect that to GDB & the crash backtracer? Let's treat those separately...

For GDB, there are actually three JIT code registration mechanisms (that I could count)...

  1. The one using __jit_debug_register_code (which I used in my POC): https://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html
  2. One that lets you load a .so file in GDB to help it understand your JIT stacks: https://sourceware.org/gdb/onlinedocs/gdb/Writing-JIT-Debug-Info-Readers.html
  3. One based on the Python interface: https://sourceware.org/gdb/onlinedocs/gdb/Unwinding-Frames-in-Python.html

We already ship GDB helpers with Ruby (in .gdbinit). It's hopefully possible to write some Python which can unwind YJIT stacks using the custom unwind info, and also distribute that inside the Ruby source tree (perhaps it's even possible to distribute it inline in .gdbinit - I can experiment with the specifics of this).

For the crash backtracer, I think libunwind can be bent into shape for our purposes.

  • We can add a configure flag --with-libunwind or such to compile Ruby against libunwind if present, even when that would not normally be the case on a given platform.
  • If libunwind is present, instead of using backtrace(3) to collect the stack all at once, instead use unw_init_local to begin unwinding, and unwind frame-by-frame with unw_step.
  • If we encounter an IP we recognise as belonging to YJIT, do NOT call unw_step to unwind that frame.
  • Instead, perform the unwinding logic ourselves using the YJIT unwind info, and then construct a unw_context_t for the previous frame by hand (it looks like the necessary struct definitions are present in the libunwind-${arch}.h header files.
  • Start unwinding again based on this custom context struct by calling unw_init_local; this should start unwinding from the frame below if we've done it right.

Essentially, the tradeoff here is that we can make unwind info generation much simpler, at the expense of making unwinding itself more complex (because we can't just rely on the platform's DWARF unwinder). That seems like a reasonable tradeoff to me.

Does this sound like a fruitful path to go down? I should have a few weeks more or less full time to work on this coming up (I'm taking a sabbatical from work to do open source stuff!), so I'd really like to know if something along these lines would be useful, more in line with YJIT's goals, and something which would be considered for merging.

Thanks again for your time, I appreciate it.


Footnote:

it's (GDB's jit interface) also known to be not have the best speed.

I think this concern only applies while GDB is actually attached; I don't think the speed of running the program under a debugger should be a primary concern of this unwinding work. This is moot anyway though because the ELF generation is a huge pain as you point out.

Updated by alanwu (Alan Wu) about 1 year ago

I would explore solutions that involve generating no extra metadata
because that's ideal, and may help the Windows port in the future.
For example, if we rely on frame pointer unwinding, it'd be incorrect when the PC is in
sections of the prologue/epilogue, but would cover most crashes.
We could read around the PC to figure out how to unwind from those sections
for full robustness later.

For this setup, we do need to change codegen for x64 to set up RBP, as you mentioned.
I don't expect noticeable perf loss either and will benchmark it. Setting up the
frame pointer opens up different options when profiling with Linux perf,
so it seems to be worth it regardless.

The GDB Python unwinding interface does seem enticing. Thanks for bringing it up.
Maybe though, on x64 GDB already uses the frame pointer as a fallback so will work
without extra help? Seems like A64 will need the script, though.

The plan with manually unwinding with libunwind sounds worth trying. It
does seem like the library is not really designed to directly support this;
sorry I didn't check that before my previous reply. If it doesn't work, maybe a solution
that registers a single global entry of DWARF unwind that assumes frame pointer validity which
covers the entire code region could work? That should make complexity and memory
consumption more palatable.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago

For example, if we rely on frame pointer unwinding, it'd be incorrect when the PC is in sections of the prologue/epilogue, but would cover most crashes

I do agree that this will work pretty much all of the time yeah. I want to make it work in the prologue/epilogue, but I guess that's more for completeness's sake rather than any real utility, so yeah it may not be worth generating metadata for this.

We could read around the PC to figure out how to unwind from those sections for full robustness later.

Oh interesting - I guess if we can rely on YJIT not generating opcodes like push %rbp; mov %rbp, %rsp and stp x29, x30, [sp,#-0x10]!; mov x29, sp anywhere else except the prologue, then yeah the unwinder (both the in-process one for crash reporting, and the out-of-process one in GDB's python interface) can nose around the PC and work out if it's inside the prologue/epilogue or not.

It seems I might be able to spike this out by writing a GDB python unwinder entirely outside the Ruby tree (for aarch64; need to add the frame pointers for x86_64 first before it'd work there). Maybe the way to go is for me to write that, share it around, and once it's mostly working, then port its logic into the Ruby crash reporter as well.

This does leave the question open of how to get some kind of sensible name for the yjit frames that isn't just a random address. I suppose if we're going with an approach of "smart unwinders that understand how YJIT lays out code", maybe I can get the unwinder to figure something out based on the CFP pointer. It's in a callee-saved register, and most unwinding schemes generally make it possible to recover these (I think it might be required for C++ exception unwinding to work). Otherwise perhaps we can spill it to the stack as well - I'll play around.

Updated by Eregon (Benoit Daloze) 11 months ago

I think supporting this could also help better profiling with YJIT enabled: https://github.com/tmm1/stackprof/pull/180#issuecomment-1556139533

Updated by k0kubun (Takashi Kokubun) 6 months ago

  • Status changed from Assigned to Feedback

I added --yjit-perf on Ruby master https://github.com/ruby/ruby/pull/8697. It does not unwind Ruby frames in a single YJIT frame like DWARF would be able to do. But I think it's similar to how C functions are profiled with --call-graph fp and it's a fair choice under the trade-off: --call-graph dwarf can unwind inlined functions but is slower than --call-graph fp. Given your PoC needed +1282 lines while our PR was +133 lines, this seems to have better maintainability while still being practical.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 6 months ago

That's awesome and I'm super looking forward to trying to profile our apps with perf once I finally get YJIT enabled! Assuming most of the hot code turns out to be JIT'd, this should definitely give a fairly good picture of the Ruby stack.

One question - the perf map file is going to grow without bound for a long-running process, right? I guess there's no real way around this based on how the file is specified though... I'll have a look at that problem when I actually run into it anyway.

This also doesn't address debugger support, but maybe for that I might go poke at GDB (ick) and see if it can a) be made to try frame pointer unwinding unconditionally, and b) use the perf map file as a source of symbols.

Anyway thanks so much for working on this, it's going to be really useful I think.

Updated by k0kubun (Takashi Kokubun) 6 months ago

One question - the perf map file is going to grow without bound for a long-running process, right? I guess there's no real way around this based on how the file is specified though... I'll have a look at that problem when I actually run into it anyway.

We plan to add an option to disable Code GC (compilation stops when it reaches the code size limit) in Ruby 3.3, and using that option should fix that problem. Given that perf reads map files after execution, it's inherently incompatible with Code GC. We might want to let --yjit-perf automatically enable that option too.

This also doesn't address debugger support, but maybe for that I might go poke at GDB (ick) and see if it can a) be made to try frame pointer unwinding unconditionally, and b) use the perf map file as a source of symbols.

Does GDB not use frame pointer unwinding at all, even for frames with no debug information in the address? I was hoping --yjit-perf=fp (which enables only frame pointers) can be sometimes used for helping GDB unwind frames.

Updated by k0kubun (Takashi Kokubun) 6 months ago

b) use the perf map file as a source of symbols.

So this is a post I found with quick googling: https://stackoverflow.com/questions/42739893/force-gdb-to-use-frame-pointer-based-unwinding

They say you can at least define a routine in ~/.gdbinit that unwinds frames using frame pointers, which seems to help.

It also says:

with other types of debug info such as .debug_info. Apparently this triggers gdb to stop using frame-pointer (rbp) based unwinding for any functions from that object

JIT code is not functions from an object with debug info. So GDB might choose to use frame pointers for JIT frames?

Also, it's kind of hard for me to test the behavior of GDB since it "sometimes" unwinds a backtrace beyond YJIT frames successfully. Right now, I'm not sure when it succeeds and when it fails.

Actions

Also available in: Atom PDF

Like1
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0