Feature #19790: Optionally write Ruby crash reports into a file rather than STDERR - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #19790

closed

Optionally write Ruby crash reports into a file rather than STDERR

Added by byroot (Jean Boussier) about 2 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

[ruby-core:114319]

Description

Use case¶

On our servers we set /proc/sys/kernel/core_pattern to point to a small utility that report all the crashes happening in production with the associated core dump into BugSnag.

This allowed us to find and fix many Ruby and native extensions bugs in the last few years.

However, these are hard to triage, so we'd like to augment these crash reports with the output of rb_vm_bugreport().

Problem¶

rb_vm_bugreport() is hard coded to print to STDERR, this makes it hard to extract and parse the report in a production environment, as very often STDERR is shared with other processes, so the crash report is intertwined with logs from other processes.

Feature Request¶

It would be very useful if Ruby could write the crash report to an arbitrary path rather than STDERR, akin to kernel/core_pattern.

Especially it would be useful if it supported interpolating the crashing process PID with %p like kernel/core_pattern, as it would make it easier to map that report with the core file.

This could be controller by an environment variable such as RUBY_BUGREPORT_PATH. e.g.

RUBY_BUGREPORT_PATH=/var/log/ruby/ruby-crash-pid-%p.log

Optional Features¶

kernel/core_pattern supports other interpolations, however not all of them would make sense for Ruby to support.

       %%  A single % character.
       %c  Core file size soft resource limit of crashing process
           (since Linux 2.6.24).
       %d  Dump mode—same as value returned by prctl(2)
           PR_GET_DUMPABLE (since Linux 3.7).
       %e  The process or thread's comm value, which typically is
           the same as the executable filename (without path prefix,
           and truncated to a maximum of 15 characters), but may
           have been modified to be something different; see the
           discussion of /proc/pid/comm and /proc/pid/task/tid/comm
           in proc(5).
       %E  Pathname of executable, with slashes ('/') replaced by
           exclamation marks ('!') (since Linux 3.0).
       %g  Numeric real GID of dumped process.
       %h  Hostname (same as nodename returned by uname(2)).
       %i  TID of thread that triggered core dump, as seen in the
           PID namespace in which the thread resides (since Linux
           3.18).
       %I  TID of thread that triggered core dump, as seen in the
           initial PID namespace (since Linux 3.18).
       %p  PID of dumped process, as seen in the PID namespace in
           which the process resides.
       %P  PID of dumped process, as seen in the initial PID
           namespace (since Linux 3.12).
       %s  Number of signal causing dump.
       %t  Time of dump, expressed as seconds since the Epoch,
           1970-01-01 00:00:00 +0000 (UTC).
       %u  Numeric real UID of dumped process.

Additionally, if kernel/core_pattern starts with a pipe (|), then it allows to pipe the core dump to another program, this may also make sense as a feature.

Prior Art¶

Aside from kernel/core_pattern, some other virtual machine have a similar feature, for instance the JVM has a configurable crash report:

-XX:ErrorFile=/var/log/java/hs_err_pid%p.log

Actions

Copy link

#1 [ruby-core:114347]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

https://github.com/nobu/ruby/tree/bugreport_path

Implemented %%, %e, %E, %f, %F, %p, %t and %NNN(octal).
When invoking pipe, the string is split with white spaces first without quoting, %040 would be useful to represent a space character, especially on Windows.

Actions

Copy link

#2 [ruby-core:114350]

Updated by byroot (Jean Boussier) almost 2 years ago

Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)

I take it that you are supportive of the feature?

Actions

Copy link

#3 [ruby-core:114351]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

byroot (Jean Boussier) wrote in #note-2:

Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)

I take it that you are supportive of the feature?

I'm positive, but also afraid if this can work well in a signal handler.

Actions

Copy link

#4 [ruby-core:114352]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Note that the branch includes a testing feature, and is not ready to merge as-is.

Actions

Copy link

#5 [ruby-core:114353]

Updated by kjtsanaktsidis (KJ Tsanaktsidis) almost 2 years ago

afraid if this can work well in a signal handler.

The current bugreporter does lots of things which are not really safe to do in a signal handler... even fprintf is not async-signal-safe (at least on Linux, according to signal-safety(7)). It also accesses the Ruby stack which of course may be in an inconsistent state. I guess it all works "well enough" most of the time.

Looking at your patch, it looks about as signal-safe as the current implementation is. Notably the sprintf calls are unsafe (but we're already doing that), and I'm not sure if the RSTRING_PTR etc manipulations are safe (but we do things like that in the Ruby backtrace-printing code anyway). Nothing else jumps out to my eyes as notably more hazardous. The only thing I might suggest is to push back the opening of the bugreport file (and parsing the %'s) until after the if (crashing) check in rb_vm_bugreport, so that these unsafety's can't cause recursive crashes.

(Also - unrelated to the issue at hand, but I think the crashing = true assignment might strictly speaking need a memory fence after it)

If we wanted Ruby crashdumps to be signal-safe, I guess the best option would be to simply ask the operating system to generate coredumps on crashes, and write some external tooling to do nice things like extract Ruby backtraces out of those coredumps. However that code would be quite platform-specific, and also duplicate decent amounts of Ruby's string/backtrace handling code, so I can see why this hasn't been the approach so far. Not to mention that dealing with coredump files is kind of radioactive because they contain whatever potentially sensitive data that the application was processing at the time.

Actions

Copy link

#6 [ruby-core:114354]

Updated by byroot (Jean Boussier) almost 2 years ago

also afraid if this can work well in a signal handler.

Yeah, as @kjtsanaktsidis (KJ Tsanaktsidis) said, in theory the current one already use things that aren't async signal safe, so I don't think it really change much here.

open(2) should be async-signal safe according to https://man7.org/linux/man-pages/man7/signal-safety.7.html, so unless I'm missing something that's the only real addition here.

Actions

Copy link

#7 [ruby-core:114436]

Updated by ko1 (Koichi Sasada) almost 2 years ago

Does it print to both STDERR and a file? or only to a file?

Actions

Copy link

#8 [ruby-core:114437]

Updated by byroot (Jean Boussier) almost 2 years ago

I would say only a file if the environment variable is set, with perhaps just a message on stderr saying the report was written into that path.

But I don't have a strong opinion about this, if others think it should be written in both, that's workable for me, I just personally don't think it has value.

Actions

Copy link

#9 [ruby-core:114482]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago

I accept this proposal, at least we can experiment.
I am a bit worried about calling open(2) in other functions/system calls from the signal handler.

Matz.

Actions

Copy link

#10 [ruby-core:114483]

Updated by byroot (Jean Boussier) almost 2 years ago

I accept this proposal

Thank you!

I am a bit worried about calling open(2) in other functions/system calls from the signal handler.

Unless I'm misreading it, POSIX says open(2) should be async signal safe: https://man7.org/linux/man-pages/man7/signal-safety.7.html so we should be good.

Either way, I think the crash reporter is understood as being "best effort", in rare cases, it can already happens that generating it cause a recursive crash, as long as it works in the vast majority of cases, it's not the end of the world. At this stage all bets are off anyway.

@nobu (Nobuyoshi Nakada) do you wish to finish your branch? Let me know if I can help in any way.

Actions

Copy link

#11 [ruby-core:114593]