Project

General

Profile

Actions

Bug #21220

closed

Memory corruption in update_line_coverage() [write at index -1]

Added by mbcodeandsound (Mike Bourgeous) 7 days ago. Updated about 5 hours ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-linux]
[ruby-core:121556]

Description

Hello!

I have encountered repeatable memory corruption in Ruby 3.4.2 on Ubuntu 24.04.2 LTS, which I believe is happening in update_line_coverage(). I could not reproduce this on Ruby 3.x or earlier. My findings follow. I also have detailed step-by-step notes at https://github.com/mike-bourgeous/mb-sound/issues/36

Summary

update_line_coverage() calls rb_sourceline(), subtracts one from its return value, and uses this as an index into an Array. Sometimes rb_sourceline() returns 0, and when this happens, update_line_coverage() will write to index -1 of the array. This corrupts the heap before the Array, resulting in a program crash later during GC.

As I am new to the Ruby codebase I do not know if it's normal for rb_sourceline() to return 0 and update_line_coverage() should handle it, or if something is wrong in the code that ultimately feeds rb_sourceline().

Symptom

On Linux, affected processes print one of the following errors and exit:

munmap_chunk(): invalid pointer
Aborted (core dumped)

or, if preloading libc_malloc_debug.so

malloc_check_get_size: memory corruption
Aborted (core dumped)

Reproduction

I have a reduced GitHub project that can reproduce the bug consistently both on my machine and in CI. When I try to reduce the size of this repo further, the bug stops happening.

The issue only reproduces locally if the coverage/ directory has a large .resultset.json.

# Repeatedly running the process increases the likelihood of crashing
# as the SimpleCov result file grows.
for f in `seq 1 100`; do echo $f; ruby -r./spec/simplecov_helper.rb bin/midi_roll.rb -c 40 -r 2 spec/test_data/all_notes.mid > /dev/null || break ; done

Research and reasoning

I initially found the crash during a live stream when I was upgrading a project from Ruby 2.7 to Ruby 3.4. The crash occurred when an RSpec test tried to spawn another Ruby process, while using SimpleCov to measure code coverage in both. I discovered a workaround of disabling SimpleCov in the nested process when running tests on Ruby 3.4. I used a somewhat unusual approach to get coverage metrics for subprocesses.

After the stream I wanted to understand what was really happening and see if I could find a way to re-enable test code coverage for subprocesses. I used a combination of Valgrind, GDB, and trial and error to narrow down the site of the crash and the original corruption. I wrote a GDB script to automate information gathering when the GC crash occurred, and Valgrind+vgdb to identify the original write that appeared to cause the corruption.

I reviewed the Git history of update_line_coverage(), rb_sourceline() (and the functions it calls), and a few other functions, but did not find any obvious changes between Ruby 3.3.x and Ruby 3.4.x, so the root cause is somewhere beyond my familiarity with the codebase.

Full details of my process are in my issue notes: https://github.com/mike-bourgeous/mb-sound/issues/36


Files

corruption_c_stack.txt (2.63 KB) corruption_c_stack.txt mbcodeandsound (Mike Bourgeous), 04/07/2025 05:26 PM
corruption_ruby_stack.txt (948 Bytes) corruption_ruby_stack.txt mbcodeandsound (Mike Bourgeous), 04/07/2025 05:26 PM
crash_ruby_stack.txt (4.46 KB) crash_ruby_stack.txt mbcodeandsound (Mike Bourgeous), 04/07/2025 05:26 PM
crash_c_stack.txt (26.2 KB) crash_c_stack.txt mbcodeandsound (Mike Bourgeous), 04/07/2025 05:26 PM

Related issues 1 (1 open0 closed)

Related to Ruby - Bug #21259: The Prism compiler wrongly creates a line number of zeroAssignedprismActions

Updated by mbcodeandsound (Mike Bourgeous) 7 days ago

Something like this should prevent the memory corruption, but may be hiding a deeper issue:

--- thread.c    2025-02-14 14:25:54.000000000 -0700
+++ thread_fix.c        2025-04-07 11:32:53.571115993 -0600
@@ -5675,7 +5675,7 @@
                 rb_ary_push(lines, LONG2FIX(line + 1));
                 return;
             }
-            if (line >= RARRAY_LEN(lines)) { /* no longer tracked */
+            if (line < 0 || line >= RARRAY_LEN(lines)) { /* no longer tracked */
                 return;
             }
             num = RARRAY_AREF(lines, line);

Updated by byroot (Jean Boussier) 6 days ago

I do not know if it's normal for rb_sourceline() to return 0

Yes:


const char *
rb_sourcefile(void)
{
    const rb_execution_context_t *ec = GET_EC();
    const rb_control_frame_t *cfp = rb_vm_get_ruby_level_next_cfp(ec, ec->cfp);

    if (cfp) {
        return RSTRING_PTR(rb_iseq_path(cfp->iseq));
    }
    else {
        return 0;
    }
}

If there is no Ruby frame yet (only C frames) you get zero.

You patch seems OK, even though I think it would be simpler to return right after the rb_sourcefile() call.

Updated by mame (Yusuke Endoh) 6 days ago

Thanks for the report. The issue reproduces on my machine, but it's very rare. I haven't figured out what happens.

byroot (Jean Boussier) wrote in #note-2:

If there is no Ruby frame yet (only C frames) you get zero.

That's true, but I was assuming that RUBY_EVENT_COVERAGE_LINE would never occur at such a timing.
I am OK to merge the proposed patch because it looks benign, but I am still curious how the issue itself happens.

Updated by byroot (Jean Boussier) 6 days ago

The issue reproduces on my machine, but it's very rare.

You can make it happen 100% of the time with the following patch:

diff --git a/thread.c b/thread.c
index 661ee977a9..247fe7237f 100644
--- a/thread.c
+++ b/thread.c
@@ -5662,6 +5662,9 @@ update_line_coverage(VALUE data, const rb_trace_arg_t *trace_arg)
         VALUE lines = RARRAY_AREF(coverage, COVERAGE_INDEX_LINES);
         if (lines) {
             long line = rb_sourceline() - 1;
+            if (line < 0) {
+                rb_bug("fail");
+            }
             long count;
             VALUE num;
             void rb_iseq_clear_event_flags(const rb_iseq_t *iseq, size_t pos, rb_event_flag_t reset);

It's rare because it only crashes if GC trigger which doesn't always happen.

I'm using that patch to try to come up with a reproduction (but you are welcome to dig further too, there's no guarantee I will find the repro).

Updated by byroot (Jean Boussier) 6 days ago

What I've figured for now it that the corruption is triggered by:

  # This require line makes sure the original script file is processed by simplecov
  require File.expand_path($0, '.')
    frame #4: 0x00000001003c468c ruby`rb_bug(fmt=<unavailable>) at error.c:1117:5
    frame #5: 0x00000001003c57dc ruby`update_line_coverage.cold.1 at thread.c:5666:17
    frame #6: 0x000000010021ae64 ruby`update_line_coverage(data=<unavailable>, trace_arg=<unavailable>) at thread.c:5666:17
    frame #7: 0x0000000100277fe0 ruby`exec_hooks_unprotected [inlined] exec_hooks_body(ec=0x0000000141f05bf0, list=0x0000000141f05970, trace_arg=0x000000016fdfc4c8) at vm_trace.c:352:17
    frame #8: 0x0000000100277fc0 ruby`exec_hooks_unprotected(ec=0x0000000141f05bf0, list=0x0000000141f05970, trace_arg=0x000000016fdfc4c8) at vm_trace.c:381:5
    frame #9: 0x0000000100277f30 ruby`rb_exec_event_hooks(trace_arg=<unavailable>, hooks=0x0000000141f05970, pop_p=0) at vm_trace.c:427:13
    frame #10: 0x000000010026c01c ruby`vm_trace_hook [inlined] rb_exec_event_hook_orig(ec=0x0000000141f05bf0, hooks=0x0000000141f05970, flag=65536, self=4302425520, id=0, called_id=0, klass=0, data=36, pop_p=0) at vm_core.h:2179:5
    frame #11: 0x000000010026bfe8 ruby`vm_trace_hook(ec=0x0000000141f05bf0, reg_cfp=0x0000000148127e78, pc=<unavailable>, pc_events=65537, target_event=65536, global_hooks=0x0000000141f05970, local_hooks_ptr=0x000000011b4d4ca8, val=36) at vm_insnhelper.c:7062:9
    frame #12: 0x000000010026b9e4 ruby`vm_trace(ec=0x0000000141f05bf0, reg_cfp=0x0000000148127e78) at vm_insnhelper.c:7170:13
    frame #13: 0x000000010024b128 ruby`vm_exec_core(ec=<unavailable>) at vm.inc:4972:5
    frame #14: 0x0000000100249860 ruby`rb_vm_exec(ec=0x0000000141f05bf0) at vm.c:2597:22
    frame #15: 0x000000010025e978 ruby`rb_iseq_eval(iseq=<unavailable>) at vm.c:2852:11 [artificial]
    frame #16: 0x0000000100123298 ruby`load_iseq_eval(ec=0x0000000141f05bf0, fname=4753022720) at load.c:789:5
    frame #17: 0x0000000100121350 ruby`require_internal(ec=0x0000000141f05bf0, fname=4757156480, exception=1, warn=<unavailable>) at load.c:1297:21
    frame #18: 0x0000000100120608 ruby`rb_require_string_internal(fname=4757156480, resurrect=false) at load.c:1403:22
    frame #19: 0x00000001001204d8 ruby`rb_f_require [inlined] rb_require_string(fname=4757156480) at load.c:1389:12
    frame #20: 0x00000001001204b8 ruby`rb_f_require(obj=<unavailable>, fname=<unavailable>) at load.c:1029:12
    frame #21: 0x0000000100268894 ruby`vm_call_cfunc_with_frame_(ec=0x0000000141f05bf0, reg_cfp=0x0000000148127ee8, calling=<unavailable>, argc=1, argv=0x00000001480280c0, stack_bottom=0x00000001480280b8) at vm_insnhelper.c:3794:11
    frame #22: 0x0000000100263af8 ruby`vm_call_alias(ec=<unavailable>, cfp=<unavailable>, calling=0x000000016fdfd140) at vm_insnhelper.c:4181:12
    frame #23: 0x000000010024cf4c ruby`vm_exec_core [inlined] vm_sendish(ec=0x0000000141f05bf0, reg_cfp=0x0000000148127ee8, cd=0x00006000022b3110, block_handler=0, method_explorer=mexp_search_method) at vm_insnhelper.c:5964:15
    frame #24: 0x000000010024ce48 ruby`vm_exec_core(ec=<unavailable>) at insns.def:898:11
    frame #25: 0x0000000100249860 ruby`rb_vm_exec(ec=0x0000000141f05bf0) at vm.c:2597:22
    frame #26: 0x000000010025e978 ruby`rb_iseq_eval(iseq=<unavailable>) at vm.c:2852:11 [artificial]
    frame #27: 0x0000000100123298 ruby`load_iseq_eval(ec=0x0000000141f05bf0, fname=4755544040) at load.c:789:5
    frame #28: 0x0000000100121350 ruby`require_internal(ec=0x0000000141f05bf0, fname=4726729080, exception=1, warn=<unavailable>) at load.c:1297:21

But I wasn't able to reduce it yet.

Updated by mame (Yusuke Endoh) 6 days ago · Edited

  • Status changed from Open to Assigned
  • Assignee set to prism

@byroot (Jean Boussier) Thanks! I think this is an issue of the prism compiler handling the line number around a flip-flop syntax.

test.rb

TracePoint.new(:line) do |tp|
  p tp
end.enable

load "./target.rb"

target.rb

if true..true
end
$ ~/work/ruby/local/bin/ruby --parser=prism test.rb
#<TracePoint:line test.rb:5>
#<TracePoint:line /tmp/reproduce-simplecov-ruby34-bug/target.rb:0> # <=== This lineno should be 1

$ ~/work/ruby/local/bin/ruby --parser=parse.y test.rb
#<TracePoint:line test.rb:5>
#<TracePoint:line /tmp/reproduce-simplecov-ruby34-bug/target.rb:1>

Updated by byroot (Jean Boussier) 6 days ago

Nice find!

Can source line ever legitimately be 0? Perhaps we are missing some assertions that would have caught this earlier? And perhaps there are other syntax causing the same issue.

Updated by mame (Yusuke Endoh) 6 days ago

@mbcodeandsound Just FYI, I bet you meant to write !(1..16).cover?(channel) in the following line.

https://github.com/mike-bourgeous/reproduce-simplecov-ruby34-bug/blob/d73c3fe80014cb91d8b6c64847581feb8a19d1b6/bin/midi_roll.rb#L42

Thank goodness for us, because it resulted in the discovery of a bug in Ruby :-)

Updated by mame (Yusuke Endoh) 6 days ago

byroot (Jean Boussier) wrote in #note-7:

Can source line ever legitimately be 0?

With eval, it can. It can be even negative.

$ ruby -e 'eval("p __LINE__", nil, "foo", -10000)'
-10000

And when I did this under Coverage.start(eval: true), it does [BUG] :-P

require 'coverage'

Coverage.start(eval: true, lines: true)

eval(<<END, nil, "foo", -1)
p
p
END
$ ~/work/ruby/local/bin/ruby test.rb
foo:-1: [BUG] fail
ruby 3.5.0dev (2025-04-06T03:24:23Z master e25889951f) +PRISM [x86_64-linux]

We can prevent RUBY_EVENT_COVERAGE_LINE from firing in contexts where lineno is negative or zero, but that doesn't happen very often.
So I guess I should merge the proposed patch as a precaution?

Updated by byroot (Jean Boussier) 6 days ago

So I guess I should merge the proposed patch as a precaution?

I think so yes, with that eval code as a test case.

Updated by mbcodeandsound (Mike Bourgeous) 5 days ago · Edited

mame (Yusuke Endoh) wrote in #note-8:

@mbcodeandsound Just FYI, I bet you meant to write !(1..16).cover?(channel) in the following line.

https://github.com/mike-bourgeous/reproduce-simplecov-ruby34-bug/blob/d73c3fe80014cb91d8b6c64847581feb8a19d1b6/bin/midi_roll.rb#L42

Thank goodness for us, because it resulted in the discovery of a bug in Ruby :-)

Yes I did, thanks (0..15 as well rather than 1..16)! I didn't even know about the flip-flop operator, or its controversial almost removal, until today. I also did not expect to encounter a parser bug on top of a coverage-related bug. I'm glad my program's bug uncovered some bugs in Ruby either way, and it's also fun that the bug was on line 42.

Ruby continues to be my favorite language. Thanks for what you do!

Updated by mame (Yusuke Endoh) 5 days ago

I created a PR.

https://github.com/ruby/ruby/pull/13089

The PR should prevent RUBY_EVENT_COVERAGE_LINE from firing on non-positive line numbers. (The fix looked easier than I thought.)

I will create another ticket for the bug of the prism compiler.

Actions #13

Updated by mame (Yusuke Endoh) 5 days ago

  • Related to Bug #21259: The Prism compiler wrongly creates a line number of zero added
Actions #14

Updated by byroot (Jean Boussier) 5 days ago

  • Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN to 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED

Updated by byroot (Jean Boussier) 5 days ago

  • Assignee deleted (prism)
Actions #16

Updated by mame (Yusuke Endoh) 5 days ago

  • Status changed from Assigned to Closed

Applied in changeset git|0d6263bd416338a339651fb97fe4d62701704c4b.


Fix coverage measurement for negative line numbers

Fixes [Bug #21220]

Co-Authored-By: Mike Bourgeous
Co-Authored-By: Jean Boussier

Updated by k0kubun (Takashi Kokubun) about 5 hours ago

  • Backport changed from 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED to 3.2: WONTFIX, 3.3: REQUIRED, 3.4: DONE
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0