Project

General

Profile

Actions

Bug #20169

closed

`GC.compact` can raises `EFAULT` on IO

Added by ko1 (Koichi Sasada) about 2 months ago. Updated 25 days ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:116114]

Description

  1. GC.compact introduces read barriers to detect read accesses to the pages.
  2. I/O operations release GVL to pass the control while their execution, and another thread can call GC.compact (or auto compact feature I guess, but not checked yet).
  3. Call write(ptr) can return EFAULT when GC.compact is running because ptr can point read-barrier protected pages (embed strings).

Reproducible steps:

Apply the following patch to increase possibility:

diff --git a/io.c b/io.c
index f6cd2c1a56..83d67ba2dc 100644
--- a/io.c
+++ b/io.c
@@ -1212,8 +1212,12 @@ internal_write_func(void *ptr)
         }
     }

+    int cnt = 0;
   retry:
-    do_write_retry(write(iis->fd, iis->buf, iis->capa));
+    for (; cnt < 1000; cnt++) {
+        do_write_retry(write(iis->fd, iis->buf, iis->capa));
+        if (result <= 0) break;
+    }

     if (result < 0 && !iis->nonblock) {
         int e = errno;

Run the following code:

t1 = Thread.new{ 10_000.times.map{"#{_1}"}; GC.compact while true }
t2 = Thread.new{
  i=0
  $stdout.write "<#{i+=1}>" while true
}
t2.join

and

$ make run
(snip)
4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4>#<Thread:0x00007fa61b4dd758 ../../src/trunk/test.rb:3 run> terminated with exception (report_on_exception is true):
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
        from ../../src/trunk/test.rb:5:in `block in <main>'
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
        from ../../src/trunk/test.rb:5:in `block in <main>'
make: *** [uncommon.mk:1383: run] Error 1

I think this is why we get EFAULT on CI. To increase possibilities running many busy processes (ruby -e 'loop{}' for example) will help (and on CI environment there are such busy processes accidentally).

Actions #1

Updated by ko1 (Koichi Sasada) about 2 months ago

  • Description updated (diff)
Actions #2

Updated by ko1 (Koichi Sasada) about 2 months ago

  • Description updated (diff)

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 2 months ago

I need to think about this a bit more, but I wonder if we can fix this on Linux at least by using userfaultfd instead of registering SIGBUS/SIGSEGV handlers…

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 2 months ago

I did a bit of experimentation with the userfaultfd(2) system call here: https://gist.github.com/KJTsanaktsidis/40e2a8e23012bf16af823db9ff9a890e

The SIGBUS/SIGSEGV handling we currently do only traps userspace accesses to memory, but it seems with userfaultfd it's possible to trap kernel access to memory too - my example above works both when accessing page directly, or when write(2)'ing it into a memfd.

Instead of read-protecting pages with mprotect(2) and then handling the trap, we can do the following:

  • Register pages with userfaultfd when we allocate them
  • When we would readprotect a page, instead, remap it somewhere else with mremap(2) MREMAP_MAYMOVE, and leave a faulting region behind with MREMAP_DONTUNMAP.
  • When someone tries to read that page, we'll get the fault in the userfaultfd thread
  • The userfaultfd thread can then remap the page back into its original position with mremap(2) MAP_FIXED and re-attempt the faulting access.

This pattern is actually documented in the Linux manpage for mremap(2): https://man7.org/linux/man-pages/man2/mremap.2.html

Garbage collection: MREMAP_DONTUNMAP can be used in conjunction with userfaultfd(2) to implement garbage collection algorithms (e.g., in a Java virtual machine). Such an implementation can be cheaper (and simpler) than conventional garbage collection techniques that involve marking pages with protection PROT_NONE in conjunction with the use of a SIGSEGV handler to catch accesses to those pages.

So that's the good part. The bad parts are:

  • This will, of course, only work on Linux.
  • It will only work on Linux kernels >= 5.7
  • It requires that the calling process either have CAP_SYS_PTRACE or that /proc/sys/vm/unprivileged_userfaultfd be set to 1. It seems common distributions default this to 0 :(

So... if we did want to go down the userfaultfd handling path, we would need to either:

  • Only support GC compaction when the above conditions are met?
  • Or, have both a userfaultfd implementation and the current signal-based implementation, and just accept that GC compaction can cause rare crashes on non-linux? (this sounds bad)

The only other more portable option I can think of is to define symbols for read, write, and other system calls that take userspace buffers which overwrite and wrap the libc versions, handle EFAULT return values, and invoke the "cancel GC compaction" logic if the faulting address is in the Ruby heap. I'm open to other bright ideas anybody might have...

(EDIT: obviously we can change Ruby's own usages of write to wrapped versions (or even just not access Ruby object buffers directly without the GVL), but we'd need the symbol overrides to make extension modules work correctly, I think)

Updated by luke-gru (Luke Gruber) about 2 months ago

Another idea is to unembed the string (undesirable) or copy the buffer (on the stack, preferably). Since embedded strings should be small I think this should work. This won't prevent C extensions from calling IO with embedded ruby strings though.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 2 months ago

Well I did a bit more thinking about this.

Firstly, I had a very unproductive morning trying to see if Mach exceptions on MacOS could catch invalid accesses in system calls, the way that userfaultfd can on Linux. Short answer: no.

The second insight I had, though, is that if these objects are on the machine stack, then they actually should be pinned anyway. If you have a C extension and you take a pointer to some internal part of an object, it's already a requirement that you ensure the Ruby value gets spilled to the stack - i.e. you need to do something like this.

VALUE str = rb_sprintf("i am a cool string");
write(fd, RSTRING_PTR(str), RSTRING_LEN(str));
RB_GC_GUARD(str); // spill string to stack; NOT OPTIONAL!

Mabye we could change the GC compaction algorithm to not move any objects on a page (and hence skip protecting the page) if any objects in the page are live on the machine stack? That would substantially lessen the effectiveness of GC compaction I suppose, but we could maybe get that effectiveness back if userfaultfd is available?

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 2 months ago

I spent some more time today studying this problem. This is what I've learned so far. I'm definitely far from an expert in the mechanics of how GC's work, so please jump in and correct me if you are!

  1. Obviously, as @luke-gru (Luke Gruber) suggested, we can fix this particular manifestation of the problem by passing a stack buffer to the system calls in io.c and copying that to/from the Ruby heap in userspace. However, the really concerning part of this bug is what it means for the C extension API.

  2. From my research, it seems that read barriers (either SIGSEGV/userfaultfd(2) based barriers like Ruby's, or explicit barriers around every pointer read emitted by a JIT compiler) are mostly used to support incremental compaction. The relationship between read barriers and incremental compaction is somewhat analogous to the relationship between write barriers (like Ruby's RB_OBJ_WRITE macro) and incremental marking. Ruby does not actually do incremental compaction at the moment (although presumably we would like to add that someday).

  3. The reason why we need read barriers for incrmental compaction is so that user code can see a consistent view of the world even if the heap is half-compacted. I think there are two specific things read barriers need to do in incremental compaction. Imagine a read barrier macro as VALUE RB_OBJ_READ(VALUE owning_object, VALUE *object_to_read). It needs to, firstly, detect if object_to_read is T_MOVED, and update its reference in owning_object if so, to point to its new location. But, secondly, it also needs to set the pin bit on *object_to_read, so that it can't be compacted elsewhere in a subsequent step of GC compaction - once RB_OBJ_READ has returned, the value of object_to_read is on the machine stack, so it needs to be pinned just as if it was live on the machine stack during the marking phase.

  4. Although Ruby doesn't have incremental GC compaction (yet), it does have a mechanism for user code to run in between GC compaction steps. If a page is swept during compaction because of a call to gc_sweep_page here, it might free some objects on that page. If a T_DATA is freed, and it has the RUBY_TYPED_FREE_IMMEDIATELY flag, then we will call its free function right there. At this point, we have not yet called the dcompact function on the object to fix up its references to other objects; so, it might try and read VALUEs that are now T_MOVED. We need read barriers in place to detect this, just as much as if we had incremental marking.

  5. Rather than explicit read barriers based on RB_OBJ_READ, which react to already-happened moves and prevent future moves, Ruby is using mprotect(2) wizadry to trap access to pages with T_MOVED objects and invalidate moves after-the-fact. I actually can't find a discussion comparing the two anywhere, but presumably the main reason for wanting to do this automagically with page protection is that it doesn't require any changes to how C extensions work.

  6. The specific problem in this issue in io.c involves access to a pointer to the Ruby heap from inside a thread which doesn't hold the GVL. However, I believe it would be entirely possible to construct a similar problem without any threading at all, because of point 4. Imagine you had a T_DATA, which held a reference to an integer file descriptor, a string VALUE, and had the RUBY_TYPED_FREE_IMMEDIATELY flag. In its dfree function, imagine it called write(self->fd, RSTRING_PTR(self->str), RSTRING_LEN(self->str)). As far as I can tell, this is perfectly legal according to our C API rules today; the thread holds the GVL, so use of the RSTRING_ macros is OK, and it does not allocate anything, which makes it a legal dfree function. However, if str had been moved, in a prior compaction step, then the page containing self->str will be locked, and so the write call will return EFAULT instead of triggering the read barrier.

  7. I think it's currently illegal to access pointers to objects embedded in the Ruby heap without the GVL, and the accesses in io.c need to be changed to either un-embed the string or copy it through the C stack. There is no possible way we can make this safe in the presence of compaction without fundamentally rearchitecting the GC to be concurrent. However, we still need to do something about the perfectly legal non-concurrent code in point 6.

  8. So I think we also need to do one or more of the following:
    a) use an API like userfaultfd, when it's available, which allows kernel-mode accesses to be trapped just like user-mode ones. However, this is obviously only viable for Linux (and not even all the time then - the feature must be enabled in /proc/sys/vm/unprivileged_userfaultfd).
    b) overwrite all of the common libc symbols used to perform syscalls with userspace buffers, and provide wrapped versions that detect EFAULT return values, invoke the page-moved logic, and retry.
    c) declare that it is illegal to access Ruby heap memory in kernel mode. This is kind of impossible to detect, however, without also doing something like b), so it'll just lead to incredibly rare failures like the one which spawned this issue throughout the ecosystem.
    d) Implement explicit read barriers with a RB_OBJ_READ macro and a RB_TYPED_RB_PROT flag analogous to how we implement write barriers, and delete our SIGSEGV based automatic read barriers. This would be an incredibly disruptive ecosystem change (I think we'd probably need to disable compaction for any object marked by an object that didn't use read barriers). But on the plus side, it makes explicit what the requirements are for C extensions, allows the barriers to work at VALUE granularity rather than page granularity, and perhaps brings us closer to incremental compaction.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 month ago

Thought: we could probably remove the need for the read barrier if we swept the heap, and then compacted. So we wouldn't free any objects in between moving things and updating refs. It would mean scanning the first half of each heap twice (once to sweep it, and once again to find the holes to fill with compacted objects). But maybe GC.compact is not that performance sensitive since it’s mostly used before forking multiprocess web servers?

It would hurt perf for GC.auto_compact, but I don't really know if there's much to be done about it. I can't find any literature on conservative, incremental, moving GC's, which is what you'd want for auto_compact. Immix is conservative and moving, but it moves objects whilst stopping the world. Incremental or concurrent moving GC's like Shenandoah work by inserting compiler-provided read barriers around all accesses (which we can't do to C extensions) to fix up moved objects as they're loaded from the heap, and using compiler generated stackmaps (which we can't have for C extensions) to directly update existing moved references on the stack.

CC @eightbitraptor (Matthew Valentine-House) since I'm pretty sure I've heard you talk about immix before, maybe you have thoughts?

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 27 days ago

So I think we can go with that fix, but I want to try and spell out what it means for "the rules" for extensions if we accept it.

AFAICT, by merging https://github.com/ruby/ruby/pull/9817, we are saying the following:

  1. It is illegal to dereference or store into any pointer to the Ruby heap in a no-GVL context. This implies:
    a) It is illegal to directly access any field in a struct RBasic, struct RString, etc,
    b) Thus it implies that things like RSTRING_PTR(str) are 100% illegal in a no-GVL context (this shouldn't be controversial)
    c) This rule does allow you to manipulate a C structure/array/etc you received a pointer to from a struct RThing (an "external pointer"), provided
    i) you know that this structure is stored in the C heap and not the Ruby heap,
    ii) you obtained this pointer out of the struct RThing whilst holding the GVL.
    iii) and you obey the restrictions spelled out in Rule 2 below.
  2. You can dereference and read from an "external pointer" in a no-GVL context, provided that
    a) the VALUE from which you obtained the pointer is guaranteed to be GC live AND GC pinned from the beginning of your no-GVL context through to the last access of the pointer.
    i) In practice, normally extension developers will adhere to this rule by inserting an RB_GC_GUARD macro after a call to rb_thread_call_without_gvl to instruct the compiler to keep the VALUE live on the thread/fiber's C stack and thus visible to Ruby's GC.
    ii) However, there are other ways to meet this requirement too (e.g. adding the object directly as a GC root with rb_global_variable or such).
    iii) The reason why it's important for the object to be GC pinned is that if the object is moved, the object's dcompact function is called, and this function is allowed to mutate the "external pointer".
    b) No Ruby thread can be using this object, either in C or in Ruby. If an object's "external pointer"s are being used in a no-GVL context, then the object as a whole cannot be used in a GVL context.
    c) the VALUE has not at any point in its lifetime been made visible to Ruby code, IF the value is not frozen.
    i) This obviously means it e.g. cannot have been yielded back to Ruby with rb_yield, or previously returned, etc.
    ii) It also means that the object must have been hidden with a call to rb_gc_hide immediately after being allocated by the GC. There cannot be any call to Ruby, nor call to the GC, in between where the program allocates the VALUE and hides it. In practice, APIs like the rb_str_tmp_* family will return pre-hidden objects which meet these requirements. This ensures that a mutable value being accessed from C was hidden from calls like ObjectSpace.each_object.
    iii) However, if an object is frozen, then this requirement does not apply - it is legal to read an "external pointer" in a no-GVL context if the object it was obtained from is frozen, even if that object is visible to Ruby.
    iv) A corollary of iii is that an extension which exposes "external pointers" to C-heap allocated data owned by its T_DATA object cannot mutate this data if the underlying T_DATA is frozen. It's fairly uncommon for extensions to actually expose C-level APIs like this, but it is possible by e.g. exporting symbols with __attribute__((visibility(default))).
  3. You can write to an "external pointer" in a no-GVL context, provided that you adhere to rule 2, AND that the object is not frozen.
  4. When reading from and writing to "external pointers", extension developers are responsible for appropriately synchronizing their reads and writes to this pointer (in practice, most usage of such "external pointers" will be restricted to a single thread, however).

These rules are pretty complicated, but I think this is what's implied by the patch. In particular, i think 2.c.iii is implied by

if (OBJ_FROZEN_RAW(orig) && !STR_EMBED_P(orig) && !rb_str_reembeddable_p(orig)) return orig;

but was kind of surprising to me. Possibly, although these are the rules that seem to apply within cruby, we want to document a more restrictive set of rules for extension developers which are easier to reason about (perhaps even as far as "don't store results of no-GVL routines into Ruby strings at all in your extension, you must bounce everything through native memory and copy it back into Ruby with the GVL").

Whatever these rules are, I think we need a clearer description of them in extension.rdoc. If they're different from the internal rules, let's document the "real" messy internal rules in the source tree as well. I'm happy to open a PR to do this once we get some agreement on the broad outlines.

As an aside, I'm researching whether there might be ways to automatically enforce some of this in CI (both for CRuby and for extension developers) by leveraging LLVM's sanitizer infrastructure somehow - I think it would be really helpful to have an executable spec of what the rules are like this!

Updated by peterzhu2118 (Peter Zhu) 25 days ago

I want to try and spell out what it means for "the rules" for extensions if we accept it.

I don't think this implies any new rules. IMO it's undefined behaviour to use a Ruby object in a non-GVL scenario, this just fixes the issue inside of Ruby using a hack. Extension developers should not be relying on anything implied by this patch.

It is illegal to directly access any field in a struct RBasic, struct RString, etc

It's already illegal (or undefined behaviour) to do this.

Possibly, although these are the rules that seem to apply within cruby

They're not exactly rules in CRuby. The implementation inside of CRuby is tightly coupled, so changing the implementation of one thing can lead to many other things needing their implementation changed.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 25 days ago

IMO it's undefined behaviour to use a Ruby object in a non-GVL scenario, this just fixes the issue inside of Ruby using a hack.

In this case, shouldn’t we fix the original issuing by copying the write through a C-allocated buffer? But…

They're not exactly rules in CRuby

I guess what you’re saying is “CRuby itself can do what it wants as long as it’s internally consistent”. Which makes sense.

I guess it would be good to find a way to define and enforce what “internally consistent” means, but perhaps there isn’t much value in strictly defining it in the absence of a way to strictly enforce it. Things like VM_CHECK_MODE help, but couldn’t catch something like this. If I actually come up with an actionable idea, I’ll open a new proposal for that.

———

Anyway, fixing this bug with your patch sounds good to me. I definitely am not trying to hold up real fixes just to have academic arguments about memory models.

Actions #13

Updated by peterzhu2118 (Peter Zhu) 25 days ago

  • Status changed from Open to Closed

Applied in changeset git|5e0c17145131e073814c7e5b15227d0b4e73cabe.


Make io_fwrite safe for compaction

[Bug #20169]

Embedded strings are not safe for system calls without the GVL because
compaction can cause pages to be locked causing the operation to fail
with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
which guarantees that the return string is not embedded.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0