Bug #20169
closed`GC.compact` can raises `EFAULT` on IO
Added by ko1 (Koichi Sasada) about 1 year ago. Updated 6 months ago.
Description
-
GC.compact
introduces read barriers to detect read accesses to the pages. - I/O operations release GVL to pass the control while their execution, and another thread can call
GC.compact
(or auto compact feature I guess, but not checked yet). - Call
write(ptr)
can returnEFAULT
whenGC.compact
is running becauseptr
can point read-barrier protected pages (embed strings).
Reproducible steps:
Apply the following patch to increase possibility:
diff --git a/io.c b/io.c
index f6cd2c1a56..83d67ba2dc 100644
--- a/io.c
+++ b/io.c
@@ -1212,8 +1212,12 @@ internal_write_func(void *ptr)
}
}
+ int cnt = 0;
retry:
- do_write_retry(write(iis->fd, iis->buf, iis->capa));
+ for (; cnt < 1000; cnt++) {
+ do_write_retry(write(iis->fd, iis->buf, iis->capa));
+ if (result <= 0) break;
+ }
if (result < 0 && !iis->nonblock) {
int e = errno;
Run the following code:
t1 = Thread.new{ 10_000.times.map{"#{_1}"}; GC.compact while true }
t2 = Thread.new{
i=0
$stdout.write "<#{i+=1}>" while true
}
t2.join
and
$ make run
(snip)
4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4>#<Thread:0x00007fa61b4dd758 ../../src/trunk/test.rb:3 run> terminated with exception (report_on_exception is true):
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
from ../../src/trunk/test.rb:5:in `block in <main>'
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
from ../../src/trunk/test.rb:5:in `block in <main>'
make: *** [uncommon.mk:1383: run] Error 1
I think this is why we get EFAULT
on CI. To increase possibilities running many busy processes (ruby -e 'loop{}'
for example) will help (and on CI environment there are such busy processes accidentally).
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
I need to think about this a bit more, but I wonder if we can fix this on Linux at least by using userfaultfd
instead of registering SIGBUS/SIGSEGV handlers…
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
I did a bit of experimentation with the userfaultfd(2)
system call here: https://gist.github.com/KJTsanaktsidis/40e2a8e23012bf16af823db9ff9a890e
The SIGBUS/SIGSEGV handling we currently do only traps userspace accesses to memory, but it seems with userfaultfd it's possible to trap kernel access to memory too - my example above works both when accessing page
directly, or when write(2)
'ing it into a memfd.
Instead of read-protecting pages with mprotect(2)
and then handling the trap, we can do the following:
- Register pages with userfaultfd when we allocate them
- When we would readprotect a page, instead, remap it somewhere else with
mremap(2)
MREMAP_MAYMOVE
, and leave a faulting region behind withMREMAP_DONTUNMAP
. - When someone tries to read that page, we'll get the fault in the userfaultfd thread
- The userfaultfd thread can then remap the page back into its original position with
mremap(2)
MAP_FIXED
and re-attempt the faulting access.
This pattern is actually documented in the Linux manpage for mremap(2)
: https://man7.org/linux/man-pages/man2/mremap.2.html
Garbage collection:
MREMAP_DONTUNMAP
can be used in conjunction withuserfaultfd(2)
to implement garbage collection algorithms (e.g., in a Java virtual machine). Such an implementation can be cheaper (and simpler) than conventional garbage collection techniques that involve marking pages with protectionPROT_NONE
in conjunction with the use of aSIGSEGV
handler to catch accesses to those pages.
So that's the good part. The bad parts are:
- This will, of course, only work on Linux.
- It will only work on Linux kernels >= 5.7
- It requires that the calling process either have
CAP_SYS_PTRACE
or that/proc/sys/vm/unprivileged_userfaultfd
be set to 1. It seems common distributions default this to 0 :(
So... if we did want to go down the userfaultfd handling path, we would need to either:
- Only support GC compaction when the above conditions are met?
- Or, have both a userfaultfd implementation and the current signal-based implementation, and just accept that GC compaction can cause rare crashes on non-linux? (this sounds bad)
The only other more portable option I can think of is to define symbols for read
, write
, and other system calls that take userspace buffers which overwrite and wrap the libc versions, handle EFAULT return values, and invoke the "cancel GC compaction" logic if the faulting address is in the Ruby heap. I'm open to other bright ideas anybody might have...
(EDIT: obviously we can change Ruby's own usages of write
to wrapped versions (or even just not access Ruby object buffers directly without the GVL), but we'd need the symbol overrides to make extension modules work correctly, I think)
Updated by luke-gru (Luke Gruber) about 1 year ago
Another idea is to unembed the string (undesirable) or copy the buffer (on the stack, preferably). Since embedded strings should be small I think this should work. This won't prevent C extensions from calling IO with embedded ruby strings though.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
Well I did a bit more thinking about this.
Firstly, I had a very unproductive morning trying to see if Mach exceptions on MacOS could catch invalid accesses in system calls, the way that userfaultfd
can on Linux. Short answer: no.
The second insight I had, though, is that if these objects are on the machine stack, then they actually should be pinned anyway. If you have a C extension and you take a pointer to some internal part of an object, it's already a requirement that you ensure the Ruby value gets spilled to the stack - i.e. you need to do something like this.
VALUE str = rb_sprintf("i am a cool string");
write(fd, RSTRING_PTR(str), RSTRING_LEN(str));
RB_GC_GUARD(str); // spill string to stack; NOT OPTIONAL!
Mabye we could change the GC compaction algorithm to not move any objects on a page (and hence skip protecting the page) if any objects in the page are live on the machine stack? That would substantially lessen the effectiveness of GC compaction I suppose, but we could maybe get that effectiveness back if userfaultfd is available?
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
I spent some more time today studying this problem. This is what I've learned so far. I'm definitely far from an expert in the mechanics of how GC's work, so please jump in and correct me if you are!
-
Obviously, as @luke-gru suggested, we can fix this particular manifestation of the problem by passing a stack buffer to the system calls in
io.c
and copying that to/from the Ruby heap in userspace. However, the really concerning part of this bug is what it means for the C extension API. -
From my research, it seems that read barriers (either
SIGSEGV
/userfaultfd(2)
based barriers like Ruby's, or explicit barriers around every pointer read emitted by a JIT compiler) are mostly used to support incremental compaction. The relationship between read barriers and incremental compaction is somewhat analogous to the relationship between write barriers (like Ruby'sRB_OBJ_WRITE
macro) and incremental marking. Ruby does not actually do incremental compaction at the moment (although presumably we would like to add that someday). -
The reason why we need read barriers for incrmental compaction is so that user code can see a consistent view of the world even if the heap is half-compacted. I think there are two specific things read barriers need to do in incremental compaction. Imagine a read barrier macro as
VALUE RB_OBJ_READ(VALUE owning_object, VALUE *object_to_read)
. It needs to, firstly, detect ifobject_to_read
isT_MOVED
, and update its reference inowning_object
if so, to point to its new location. But, secondly, it also needs to set the pin bit on*object_to_read
, so that it can't be compacted elsewhere in a subsequent step of GC compaction - onceRB_OBJ_READ
has returned, the value ofobject_to_read
is on the machine stack, so it needs to be pinned just as if it was live on the machine stack during the marking phase. -
Although Ruby doesn't have incremental GC compaction (yet), it does have a mechanism for user code to run in between GC compaction steps. If a page is swept during compaction because of a call to
gc_sweep_page
here, it might free some objects on that page. If aT_DATA
is freed, and it has theRUBY_TYPED_FREE_IMMEDIATELY
flag, then we will call its free function right there. At this point, we have not yet called thedcompact
function on the object to fix up its references to other objects; so, it might try and read VALUEs that are nowT_MOVED
. We need read barriers in place to detect this, just as much as if we had incremental marking. -
Rather than explicit read barriers based on
RB_OBJ_READ
, which react to already-happened moves and prevent future moves, Ruby is usingmprotect(2)
wizadry to trap access to pages withT_MOVED
objects and invalidate moves after-the-fact. I actually can't find a discussion comparing the two anywhere, but presumably the main reason for wanting to do this automagically with page protection is that it doesn't require any changes to how C extensions work. -
The specific problem in this issue in
io.c
involves access to a pointer to the Ruby heap from inside a thread which doesn't hold the GVL. However, I believe it would be entirely possible to construct a similar problem without any threading at all, because of point 4. Imagine you had aT_DATA
, which held a reference to an integer file descriptor, a string VALUE, and had theRUBY_TYPED_FREE_IMMEDIATELY
flag. In itsdfree
function, imagine it calledwrite(self->fd, RSTRING_PTR(self->str), RSTRING_LEN(self->str))
. As far as I can tell, this is perfectly legal according to our C API rules today; the thread holds the GVL, so use of theRSTRING_
macros is OK, and it does not allocate anything, which makes it a legaldfree
function. However, ifstr
had been moved, in a prior compaction step, then the page containingself->str
will be locked, and so thewrite
call will returnEFAULT
instead of triggering the read barrier. -
I think it's currently illegal to access pointers to objects embedded in the Ruby heap without the GVL, and the accesses in
io.c
need to be changed to either un-embed the string or copy it through the C stack. There is no possible way we can make this safe in the presence of compaction without fundamentally rearchitecting the GC to be concurrent. However, we still need to do something about the perfectly legal non-concurrent code in point 6. -
So I think we also need to do one or more of the following:
a) use an API like userfaultfd, when it's available, which allows kernel-mode accesses to be trapped just like user-mode ones. However, this is obviously only viable for Linux (and not even all the time then - the feature must be enabled in/proc/sys/vm/unprivileged_userfaultfd
).
b) overwrite all of the common libc symbols used to perform syscalls with userspace buffers, and provide wrapped versions that detect EFAULT return values, invoke the page-moved logic, and retry.
c) declare that it is illegal to access Ruby heap memory in kernel mode. This is kind of impossible to detect, however, without also doing something like b), so it'll just lead to incredibly rare failures like the one which spawned this issue throughout the ecosystem.
d) Implement explicit read barriers with aRB_OBJ_READ
macro and aRB_TYPED_RB_PROT
flag analogous to how we implement write barriers, and delete our SIGSEGV based automatic read barriers. This would be an incredibly disruptive ecosystem change (I think we'd probably need to disable compaction for any object marked by an object that didn't use read barriers). But on the plus side, it makes explicit what the requirements are for C extensions, allows the barriers to work at VALUE granularity rather than page granularity, and perhaps brings us closer to incremental compaction.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) almost 1 year ago
Thought: we could probably remove the need for the read barrier if we swept the heap, and then compacted. So we wouldn't free any objects in between moving things and updating refs. It would mean scanning the first half of each heap twice (once to sweep it, and once again to find the holes to fill with compacted objects). But maybe GC.compact
is not that performance sensitive since it’s mostly used before forking multiprocess web servers?
It would hurt perf for GC.auto_compact
, but I don't really know if there's much to be done about it. I can't find any literature on conservative, incremental, moving GC's, which is what you'd want for auto_compact
. Immix is conservative and moving, but it moves objects whilst stopping the world. Incremental or concurrent moving GC's like Shenandoah work by inserting compiler-provided read barriers around all accesses (which we can't do to C extensions) to fix up moved objects as they're loaded from the heap, and using compiler generated stackmaps (which we can't have for C extensions) to directly update existing moved references on the stack.
CC @eightbitraptor (Matt V-H) since I'm pretty sure I've heard you talk about immix before, maybe you have thoughts?
Updated by peterzhu2118 (Peter Zhu) 12 months ago
I implemented a fix here: https://github.com/ruby/ruby/pull/9817
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 12 months ago
So I think we can go with that fix, but I want to try and spell out what it means for "the rules" for extensions if we accept it.
AFAICT, by merging https://github.com/ruby/ruby/pull/9817, we are saying the following:
- It is illegal to dereference or store into any pointer to the Ruby heap in a no-GVL context. This implies:
a) It is illegal to directly access any field in astruct RBasic
,struct RString
, etc,
b) Thus it implies that things likeRSTRING_PTR(str)
are 100% illegal in a no-GVL context (this shouldn't be controversial)
c) This rule does allow you to manipulate a C structure/array/etc you received a pointer to from astruct RThing
(an "external pointer"), provided
i) you know that this structure is stored in the C heap and not the Ruby heap,
ii) you obtained this pointer out of thestruct RThing
whilst holding the GVL.
iii) and you obey the restrictions spelled out in Rule 2 below. - You can dereference and read from an "external pointer" in a no-GVL context, provided that
a) the VALUE from which you obtained the pointer is guaranteed to be GC live AND GC pinned from the beginning of your no-GVL context through to the last access of the pointer.
i) In practice, normally extension developers will adhere to this rule by inserting anRB_GC_GUARD
macro after a call torb_thread_call_without_gvl
to instruct the compiler to keep the VALUE live on the thread/fiber's C stack and thus visible to Ruby's GC.
ii) However, there are other ways to meet this requirement too (e.g. adding the object directly as a GC root withrb_global_variable
or such).
iii) The reason why it's important for the object to be GC pinned is that if the object is moved, the object'sdcompact
function is called, and this function is allowed to mutate the "external pointer".
b) No Ruby thread can be using this object, either in C or in Ruby. If an object's "external pointer"s are being used in a no-GVL context, then the object as a whole cannot be used in a GVL context.
c) the VALUE has not at any point in its lifetime been made visible to Ruby code, IF the value is not frozen.
i) This obviously means it e.g. cannot have been yielded back to Ruby withrb_yield
, or previously returned, etc.
ii) It also means that the object must have been hidden with a call torb_gc_hide
immediately after being allocated by the GC. There cannot be any call to Ruby, nor call to the GC, in between where the program allocates the VALUE and hides it. In practice, APIs like therb_str_tmp_*
family will return pre-hidden objects which meet these requirements. This ensures that a mutable value being accessed from C was hidden from calls likeObjectSpace.each_object
.
iii) However, if an object is frozen, then this requirement does not apply - it is legal to read an "external pointer" in a no-GVL context if the object it was obtained from is frozen, even if that object is visible to Ruby.
iv) A corollary of iii is that an extension which exposes "external pointers" to C-heap allocated data owned by its T_DATA object cannot mutate this data if the underlyingT_DATA
is frozen. It's fairly uncommon for extensions to actually expose C-level APIs like this, but it is possible by e.g. exporting symbols with__attribute__((visibility(default)))
. - You can write to an "external pointer" in a no-GVL context, provided that you adhere to rule 2, AND that the object is not frozen.
- When reading from and writing to "external pointers", extension developers are responsible for appropriately synchronizing their reads and writes to this pointer (in practice, most usage of such "external pointers" will be restricted to a single thread, however).
These rules are pretty complicated, but I think this is what's implied by the patch. In particular, i think 2.c.iii is implied by
if (OBJ_FROZEN_RAW(orig) && !STR_EMBED_P(orig) && !rb_str_reembeddable_p(orig)) return orig;
but was kind of surprising to me. Possibly, although these are the rules that seem to apply within cruby, we want to document a more restrictive set of rules for extension developers which are easier to reason about (perhaps even as far as "don't store results of no-GVL routines into Ruby strings at all in your extension, you must bounce everything through native memory and copy it back into Ruby with the GVL").
Whatever these rules are, I think we need a clearer description of them in extension.rdoc. If they're different from the internal rules, let's document the "real" messy internal rules in the source tree as well. I'm happy to open a PR to do this once we get some agreement on the broad outlines.
As an aside, I'm researching whether there might be ways to automatically enforce some of this in CI (both for CRuby and for extension developers) by leveraging LLVM's sanitizer infrastructure somehow - I think it would be really helpful to have an executable spec of what the rules are like this!
Updated by peterzhu2118 (Peter Zhu) 12 months ago
I want to try and spell out what it means for "the rules" for extensions if we accept it.
I don't think this implies any new rules. IMO it's undefined behaviour to use a Ruby object in a non-GVL scenario, this just fixes the issue inside of Ruby using a hack. Extension developers should not be relying on anything implied by this patch.
It is illegal to directly access any field in a struct RBasic, struct RString, etc
It's already illegal (or undefined behaviour) to do this.
Possibly, although these are the rules that seem to apply within cruby
They're not exactly rules in CRuby. The implementation inside of CRuby is tightly coupled, so changing the implementation of one thing can lead to many other things needing their implementation changed.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 12 months ago
IMO it's undefined behaviour to use a Ruby object in a non-GVL scenario, this just fixes the issue inside of Ruby using a hack.
In this case, shouldn’t we fix the original issuing by copying the write through a C-allocated buffer? But…
They're not exactly rules in CRuby
I guess what you’re saying is “CRuby itself can do what it wants as long as it’s internally consistent”. Which makes sense.
I guess it would be good to find a way to define and enforce what “internally consistent” means, but perhaps there isn’t much value in strictly defining it in the absence of a way to strictly enforce it. Things like VM_CHECK_MODE help, but couldn’t catch something like this. If I actually come up with an actionable idea, I’ll open a new proposal for that.
———
Anyway, fixing this bug with your patch sounds good to me. I definitely am not trying to hold up real fixes just to have academic arguments about memory models.
Updated by peterzhu2118 (Peter Zhu) 11 months ago
- Status changed from Open to Closed
Applied in changeset git|5e0c17145131e073814c7e5b15227d0b4e73cabe.
Make io_fwrite safe for compaction
[Bug #20169]
Embedded strings are not safe for system calls without the GVL because
compaction can cause pages to be locked causing the operation to fail
with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
which guarantees that the return string is not embedded.
Updated by k0kubun (Takashi Kokubun) 8 months ago
- Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: DONE
ruby_3_3 b77b5c191513f5f281e72a51e6b2de29e2d2d7a6 merged revision(s) 5e0c17145131e073814c7e5b15227d0b4e73cabe.
Updated by nagachika (Tomoyuki Chikanaga) 8 months ago
- Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: DONE to 3.1: DONTNEED, 3.2: REQUIRED, 3.3: DONE
Updated by nagachika (Tomoyuki Chikanaga) 6 months ago
- Backport changed from 3.1: DONTNEED, 3.2: REQUIRED, 3.3: DONE to 3.1: DONTNEED, 3.2: DONE, 3.3: DONE
ruby_3_2 6b73406833dd22e489114fa77c1c80c4b7af2ed0 merged revision(s) 5e0c17145131e073814c7e5b15227d0b4e73cabe.
Updated by nagachika (Tomoyuki Chikanaga) 6 months ago
- Backport changed from 3.1: DONTNEED, 3.2: DONE, 3.3: DONE to 3.1: DONTNEED, 3.2: WONTFIX, 3.3: DONE
Reverted 6b73406833dd22e489114fa77c1c80c4b7af2ed0. It introduce failures on build condition with USE_RVARGC=0.