Project

General

Profile

Bug #15362

[PATCH] Avoid GCing dead stack after switching away from a fiber

Added by alanwu (Alan Wu) 10 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
Backport:
[ruby-core:90193]

Description

Hello! I have a patch that fixes Bug #14561. It's not a platform specific issue but
it affects the default build configuration for MacOS and is causing segfaults on 2.5.x.
I've put the test for this in a separate patch because I'm not sure if we want to have
a 5 second test that only matters for non-default build configs and doesn't catch things reliably on Linux.

I tested this on both trunk and ruby_2_5, on MacOS and on Linux, on various build configs.

Please let me know if anything in my understanding is wrong. I've pasted my commit message below.


Fibers save execution contextes, and execution contexts include a native
stack pointer. It may happen that a Fiber outlive the native thread
it executed on. Consider the following code adapted from Bug #14561:

enum = Enumerator.new { |y| y << 1 }
thread = Thread.new { enum.peek }  # fiber constructed inside the
                                   # block and saved inside `enum`
thread.join
sleep 5      # thread finishes and thread cache wait time runs out.
             # Native thread exits, possibly freeing its stack.
GC.start     # segfault because GC tires to mark the dangling stack pointer
             # inside `enum`'s fiber

The problem is masked by FIBER_USE_COROUTINE and FIBER_USE_NATIVE,
as those implementations already do what this commit does.
Generally on Linux systems, FIBER_USE_NATIVE is 1 even when
one uses ./configure --disable-fiber-coroutine, since most
Linux systems have getcontext() and setcontext() which
turns on FIBER_USE_NATIVE. (compile with `make
DEFS="-DFIBER_USE_NATIVE=0" to explicitly disable it)

Furthermore, when both FIBER_USE_COROUTINE and FIBER_USE_NATIVE
are off, and the GC reads from the stack of a dead native
thread, MRI does not segfault on Linux. This is probably due to
libpthread not marking the page where the dead stack lives as
unreadable. Nevertheless, this use-after-free is visible through
Valgrind.

On ruby_2_5, this is an acute problem, since it doesn't have FIBER_USE_COROUTINE.
Thread cache is also unavailable for 2.5.x, triggering this issue
more often. (thread cache gives this bug a grace period since
it makes native threads wait a little before exiting)

This issue is very visible on MacOS on 2.5.x since libpthread marks
the dead stack as unreadable, consistently turning this use-after-free
into a segfault.

Fixes Bug #14561

  • cont.c: Set saved_ec.machine.stack_end to NULL when switching away from a fiber to keep the GC marking it. saved_ec gets rehydrated with a stack pointer if/when the fiber runs again.

Files


Related issues

Related to Ruby master - Bug #14714: Ruby 2.5.1 Segmentation Fault in GCClosedActions
Related to Ruby master - Bug #15375: Crash report for Ruby 2.5.3p105ClosedActions
Has duplicate Ruby master - Bug #14561: Consistent 2.5.0 seg fault in GC, related to accessing an enumerator in a threadClosedActions

History

#1

Updated by alanwu (Alan Wu) 10 months ago

  • Description updated (diff)

Updated by aselder (Andrew Selder) 10 months ago

Alan,

You're amazing... We're completely blocked on upgrading by this bug. You're a savior

Thanks,

Andrew

Updated by ioquatix (Samuel Williams) 10 months ago

  • Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN to 2.5: REQUIRED
  • Target version set to 2.6
  • Assignee set to ioquatix (Samuel Williams)

I will take look at it.

Updated by ioquatix (Samuel Williams) 10 months ago

I looked at the patch it seemed okay.

I committed it to trunk r66111

Once we've confirmed it solves the issue, I will backport 2.5 branch, hopefully it can be included 2.5.4 or next patch release.

Updated by ioquatix (Samuel Williams) 10 months ago

After I merged this patch, https://travis-ci.org/ruby/ruby/jobs/462071967 fails. I am testing locally in more detail.

Updated by ioquatix (Samuel Williams) 10 months ago

Okay, it was random failure. That concerns me too. But, it passed now. I'll merge to ruby_2_5 too.

Updated by normalperson (Eric Wong) 10 months ago

samuel@oriontransfer.net wrote:

After I merged this patch,
https://travis-ci.org/ruby/ruby/jobs/462071967 fails. I am
testing locally in more detail.

Alan + Sam: thanks for investigating this.

Perhaps an alternative fix would be to avoid calling
rb_gc_mark_machine_stack() if the owner thread is dead.

I couldn't reproduce the original problem on FreeBSD 11.2
with both FIBER_USE_NATIVE and FIBER_USE_COROUTINE disabled;
so I can't verify my hypothesis
(Didn't bother testing on Linux)

https://bugs.ruby-lang.org/issues/15362#change-75323

Updated by ioquatix (Samuel Williams) 10 months ago

normalperson (Eric Wong) I am open to ideas. I don't know a lot about how thread and GC interact within Ruby, unfortunately it's not an area I spent much time on. So, if you have better proposal, I suggest you implement it. This change included a spec, so I was happy to merge it even thought I didn't fully understand what is going wrong. Confirmed green build matrix so I am happy to close. If you want to make a change, let me know, otherwise I will close this and it would be backported to 2.5.

Updated by ioquatix (Samuel Williams) 10 months ago

normalperson (Eric Wong) I thought about it a bit more, perhaps both ideas should be implemented for maximum protection. We could add some notes in the source code too about the issue/invariants.

Updated by normalperson (Eric Wong) 10 months ago

samuel@oriontransfer.net wrote:

normalperson (Eric Wong) I thought about it a bit more, perhaps both
ideas should be implemented for maximum protection. We could
add some notes in the source code too about the
issue/invariants.

NAK. I'm against belt-and-suspenders fixes because it favors
lack-of-understanding and makes Ruby slower with more overhead.

Looking at Alan's original fix (which you committed), it seems
to make sense. Anyways, my hypothesis is here:

https://80x24.org/spew/20181201063852.30438-1-e@80x24.org/raw

It may allow Alan's simple one-line fix to be reverted.
However, I prefer Alan's one-liner fix since it's smaller
with instructions with no extra branches.

Updated by ioquatix (Samuel Williams) 10 months ago

  • Status changed from Open to Closed

Okay, I think we are in agreement then.

Updated by alanwu (Alan Wu) 10 months ago

A comment for future maintainers:
Since &fib->cont.saved_ec == ruby_current_execution_context_ptr in the place my patch makes the change, if GC starts below where I set stack_end to NULL, the stack doesn't get marked.
This is not a problem right now since there is no call to any function that could trigger GC (correct me if I'm wrong), but future changes should be careful to not introduce such calls.
I wish there was some way to codify this constraint and have the compiler statically check this for us.

Updated by dazuma (Daniel Azuma) 9 months ago

Did this get backported to ruby_2_5? I don't see a corresponding commit in the github mirror https://github.com/ruby/ruby/commits/ruby_2_5

Updated by aselder (Andrew Selder) 9 months ago

This still looks like it's waiting on a backport to Ruby 2.5. Also, does anyone know when the next release of the Ruby 2.5 branch will be done?

Thanks!

#15

Updated by nagachika (Tomoyuki Chikanaga) 8 months ago

  • Has duplicate Bug #14561: Consistent 2.5.0 seg fault in GC, related to accessing an enumerator in a thread added

Updated by nagachika (Tomoyuki Chikanaga) 8 months ago

  • Backport changed from 2.5: REQUIRED to 2.5: DONE

ruby_2_5 r66777 merged revision(s) 66111.

#17

Updated by wanabe (_ wanabe) 7 months ago

  • Related to Bug #14714: Ruby 2.5.1 Segmentation Fault in GC added
#18

Updated by wanabe (_ wanabe) 7 months ago

  • Related to Bug #15375: Crash report for Ruby 2.5.3p105 added

Also available in: Atom PDF