Project

General

Profile

Actions

Feature #19790

closed

Optionally write Ruby crash reports into a file rather than STDERR

Added by byroot (Jean Boussier) 12 months ago. Updated 10 months ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:114319]

Description

Use case

On our servers we set /proc/sys/kernel/core_pattern to point to a small utility that report all the crashes happening in production with the associated core dump into BugSnag.

This allowed us to find and fix many Ruby and native extensions bugs in the last few years.

However, these are hard to triage, so we'd like to augment these crash reports with the output of rb_vm_bugreport().

Problem

rb_vm_bugreport() is hard coded to print to STDERR, this makes it hard to extract and parse the report in a production environment, as very often STDERR is shared with other processes, so the crash report is intertwined with logs from other processes.

Feature Request

It would be very useful if Ruby could write the crash report to an arbitrary path rather than STDERR, akin to kernel/core_pattern.

Especially it would be useful if it supported interpolating the crashing process PID with %p like kernel/core_pattern, as it would make it easier to map that report with the core file.

This could be controller by an environment variable such as RUBY_BUGREPORT_PATH. e.g.

RUBY_BUGREPORT_PATH=/var/log/ruby/ruby-crash-pid-%p.log

Optional Features

kernel/core_pattern supports other interpolations, however not all of them would make sense for Ruby to support.

       %%  A single % character.
       %c  Core file size soft resource limit of crashing process
           (since Linux 2.6.24).
       %d  Dump mode—same as value returned by prctl(2)
           PR_GET_DUMPABLE (since Linux 3.7).
       %e  The process or thread's comm value, which typically is
           the same as the executable filename (without path prefix,
           and truncated to a maximum of 15 characters), but may
           have been modified to be something different; see the
           discussion of /proc/pid/comm and /proc/pid/task/tid/comm
           in proc(5).
       %E  Pathname of executable, with slashes ('/') replaced by
           exclamation marks ('!') (since Linux 3.0).
       %g  Numeric real GID of dumped process.
       %h  Hostname (same as nodename returned by uname(2)).
       %i  TID of thread that triggered core dump, as seen in the
           PID namespace in which the thread resides (since Linux
           3.18).
       %I  TID of thread that triggered core dump, as seen in the
           initial PID namespace (since Linux 3.18).
       %p  PID of dumped process, as seen in the PID namespace in
           which the process resides.
       %P  PID of dumped process, as seen in the initial PID
           namespace (since Linux 3.12).
       %s  Number of signal causing dump.
       %t  Time of dump, expressed as seconds since the Epoch,
           1970-01-01 00:00:00 +0000 (UTC).
       %u  Numeric real UID of dumped process.

Additionally, if kernel/core_pattern starts with a pipe (|), then it allows to pipe the core dump to another program, this may also make sense as a feature.

Prior Art

Aside from kernel/core_pattern, some other virtual machine have a similar feature, for instance the JVM has a configurable crash report:

-XX:ErrorFile=/var/log/java/hs_err_pid%p.log

Updated by nobu (Nobuyoshi Nakada) 11 months ago

https://github.com/nobu/ruby/tree/bugreport_path

Implemented %%, %e, %E, %f, %F, %p, %t and %NNN(octal).
When invoking pipe, the string is split with white spaces first without quoting, %040 would be useful to represent a space character, especially on Windows.

Updated by byroot (Jean Boussier) 11 months ago

Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)

I take it that you are supportive of the feature?

Updated by nobu (Nobuyoshi Nakada) 11 months ago

byroot (Jean Boussier) wrote in #note-2:

Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)

I take it that you are supportive of the feature?

I'm positive, but also afraid if this can work well in a signal handler.

Updated by nobu (Nobuyoshi Nakada) 11 months ago

Note that the branch includes a testing feature, and is not ready to merge as-is.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago

afraid if this can work well in a signal handler.

The current bugreporter does lots of things which are not really safe to do in a signal handler... even fprintf is not async-signal-safe (at least on Linux, according to signal-safety(7)). It also accesses the Ruby stack which of course may be in an inconsistent state. I guess it all works "well enough" most of the time.

Looking at your patch, it looks about as signal-safe as the current implementation is. Notably the sprintf calls are unsafe (but we're already doing that), and I'm not sure if the RSTRING_PTR etc manipulations are safe (but we do things like that in the Ruby backtrace-printing code anyway). Nothing else jumps out to my eyes as notably more hazardous. The only thing I might suggest is to push back the opening of the bugreport file (and parsing the %'s) until after the if (crashing) check in rb_vm_bugreport, so that these unsafety's can't cause recursive crashes.

(Also - unrelated to the issue at hand, but I think the crashing = true assignment might strictly speaking need a memory fence after it)


If we wanted Ruby crashdumps to be signal-safe, I guess the best option would be to simply ask the operating system to generate coredumps on crashes, and write some external tooling to do nice things like extract Ruby backtraces out of those coredumps. However that code would be quite platform-specific, and also duplicate decent amounts of Ruby's string/backtrace handling code, so I can see why this hasn't been the approach so far. Not to mention that dealing with coredump files is kind of radioactive because they contain whatever potentially sensitive data that the application was processing at the time.

Updated by byroot (Jean Boussier) 11 months ago

also afraid if this can work well in a signal handler.

Yeah, as @kjtsanaktsidis (KJ Tsanaktsidis) said, in theory the current one already use things that aren't async signal safe, so I don't think it really change much here.

open(2) should be async-signal safe according to https://man7.org/linux/man-pages/man7/signal-safety.7.html, so unless I'm missing something that's the only real addition here.

Updated by ko1 (Koichi Sasada) 11 months ago

Does it print to both STDERR and a file? or only to a file?

Updated by byroot (Jean Boussier) 11 months ago

I would say only a file if the environment variable is set, with perhaps just a message on stderr saying the report was written into that path.

But I don't have a strong opinion about this, if others think it should be written in both, that's workable for me, I just personally don't think it has value.

Updated by matz (Yukihiro Matsumoto) 11 months ago

I accept this proposal, at least we can experiment.
I am a bit worried about calling open(2) in other functions/system calls from the signal handler.

Matz.

Updated by byroot (Jean Boussier) 11 months ago

I accept this proposal

Thank you!

I am a bit worried about calling open(2) in other functions/system calls from the signal handler.

Unless I'm misreading it, POSIX says open(2) should be async signal safe: https://man7.org/linux/man-pages/man7/signal-safety.7.html so we should be good.

Either way, I think the crash reporter is understood as being "best effort", in rare cases, it can already happens that generating it cause a recursive crash, as long as it works in the vast majority of cases, it's not the end of the world. At this stage all bets are off anyway.

@nobu (Nobuyoshi Nakada) do you wish to finish your branch? Let me know if I can help in any way.

Updated by byroot (Jean Boussier) 11 months ago

A note from the dev meeting logs, the env var that was accepted is RUBY_CRASH_REPORT (not RUBY_BUGREPORT_PATH).

Updated by byroot (Jean Boussier) 10 months ago

I've backported @nobu's branch on our 3.2 rubies and it's been working very well. Not too sure what is missing.

Actions #13

Updated by nobu (Nobuyoshi Nakada) 10 months ago

  • Status changed from Open to Closed

Applied in changeset git|bab01d284c86b39952baaddb15901fd4a15e4151.


[Feature #19790] Rename BUGREPORT_PATH as CRASH_REPORT

Actions

Also available in: Atom PDF

Like3
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0