Project

General

Profile

Actions

Bug #18782

open

Race conditions in autoload when loading the same feature with multiple threads.

Added by ioquatix (Samuel Williams) about 1 month ago. Updated about 1 month ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108552]

Description

I have identified several race conditions in the autoload code.

  1. It's possible to race on adding and then deleting items in autoload_featuremap. When this happens, two threads will try to load the same file with different autoload data and deadlock.
  2. When finishing autoload, it's necessary to clear ele->state before setting constants. If this is not synchronised, a thread can see the cleared ele->state before seeing the constants and assume the constant is not being autoloaded and then fail with NameError.

This test case can reproduce both cases:

# test.rb
autoload_path = File.join(__dir__, "foobar.rb")
File.write(autoload_path, 'module Foo; end; module Bar; end')

100_000.times do
	$stderr.puts "--------------------"
	autoload :Foo, autoload_path
	autoload :Bar, autoload_path

	t1 = Thread.new {Foo}
	t2 = Thread.new {Bar}

	t1.join
	t2.join

	Object.send(:remove_const, :Foo)
	Object.send(:remove_const, :Bar)

	$LOADED_FEATURES.delete(autoload_path)
end

Example failure of case (1):

-------------------- (success)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
check_autoload_required 2
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd678be780
check_autoload_required 4
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd678be780
check_autoload_required 4
ele=0x55f33b806a30 ele->state=0x7fdd678be780 = NULL
check_autoload_required 4
-------------------- (failure)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
autoload_by_someone_else ele=0x55f33b6e8f40 ele->state=(nil)
check_autoload_required 2
check_autoload_required 3
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd6779d780
check_autoload_required 1
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd6779d780
check_autoload_required 1
ele=0x55f33b806a30 ele->state=0x7fdd6779d780 = NULL
ele=0x55f33b6e8f40 ele->state=0x7fdd678be780 = NULL
../test.rb:12:in `join': No live threads left. Deadlock? (fatal)
3 threads, 3 sleeps current:0x000055f33b771250 main thread:0x000055f33b66e090
* #<Thread:0x00007fdd6a2cb0b0 sleep_forever>
   rb_thread_t:0x000055f33b66e090 native:0x00007fdd6a71c3c0 int:0
   
* #<Thread:0x00007fdd676e0090 ../test.rb:9 sleep_forever>
   rb_thread_t:0x000055f33b770ff0 native:0x00007fdd6789e640 int:1 mutex:0x000055f33b7c5100 cond:1
    depended by: tb_thread_id:0x000055f33b66e090
   
* #<Thread:0x00007fdd676e1238 ../test.rb:10 sleep_forever>
   rb_thread_t:0x000055f33b771250 native:0x00007fdd679bf640 int:0
   

	from ../test.rb:12:in `block in <main>'
	from ../test.rb:4:in `times'
	from ../test.rb:4:in `<main>'
make: *** [uncommon.mk:1250: runruby] Error 1

Example failure of case (2):

[0x7f175fe5b0c8] rb_autoload_str mod=Object id=Foo file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str const_set mod=Object id=Foo file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str mod=Object id=Bar file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str const_set mod=Object id=Bar file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe61d88] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] Assigning constants...
[0x7f175fe61d88] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] autoload_const_set name=:Foo value=Foo
[0x7f175fe61e78] autoload_const_set name=:Bar value=Bar
#<Thread:0x00007f175fe61d88 ../test.rb:11 run> terminated with exception (report_on_exception is true):
../test.rb:11:in `block (2 levels) in <main>': uninitialized constant Bar (NameError)
../test.rb:11:in `block (2 levels) in <main>': uninitialized constant Bar (NameError)
make: *** [uncommon.mk:1250: runruby] Error 1

These failures are very uncommon but it does impact Ruby as far back as 2.7, and probably earlier.


Files

Updated by ioquatix (Samuel Williams) about 1 month ago

@normalperson (Eric Wong) I've attached patch files. If you have time, I'd really appreciate your feedback on this. Specifically, patch 5 - freeing this unconditionally causes the Ruby VM to hang in test-bundler-parallel and I'm not sure why. Here is a backtrace of a hung process:

* thread #1, name = 'ruby', stop reason = signal SIGSTOP
  * frame #0: 0x00007f91a51972cb libc.so.6`__lll_lock_wait_private + 43
    frame #1: 0x00007f91a51ab6e0 libc.so.6`__libc_calloc + 816
    frame #2: 0x00007f91a5542823 ld-linux-x86-64.so.2`_dl_new_object + 115
    frame #3: 0x00007f91a553ded3 ld-linux-x86-64.so.2`_dl_map_object_from_fd + 179
    frame #4: 0x00007f91a553f7ac ld-linux-x86-64.so.2`_dl_map_object + 508
    frame #5: 0x00007f91a55435b9 ld-linux-x86-64.so.2`dl_open_worker_begin + 169
    frame #6: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
    frame #7: 0x00007f91a5542d7b ld-linux-x86-64.so.2`dl_open_worker + 59
    frame #8: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
    frame #9: 0x00007f91a554315d ld-linux-x86-64.so.2`_dl_open + 173
    frame #10: 0x00007f91a5268211 libc.so.6`do_dlopen + 65
    frame #11: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
    frame #12: 0x00007f91a5267ee3 libc.so.6`_dl_catch_error + 51
    frame #13: 0x00007f91a5268170 libc.so.6`dlerror_run + 64
    frame #14: 0x00007f91a526832a libc.so.6`__libc_dlopen_mode + 74
    frame #15: 0x00007f91a521f191 libc.so.6`__libc_unwind_link_get + 81
    frame #16: 0x00007f91a522c867 libc.so.6`__backtrace + 39
    frame #17: 0x000056542d70de17 ruby`rb_vm_bugreport at vm_dump.c:762:13
    frame #18: 0x000056542d7a454a ruby`rb_bug_for_fatal_signal at error.c:822:5
    frame #19: 0x000056542d663c49 ruby`sigsegv(sig=<unavailable>, info=<unavailable>, ctx=<unavailable>) at signal.c:964:5
    frame #20: 0x00007f91a514f560 libc.so.6`__restore_rt
    frame #21: 0x00007f91a51a6ef1 libc.so.6`malloc_consolidate + 257
    frame #22: 0x00007f91a51a8450 libc.so.6`_int_free + 2032
    frame #23: 0x00007f91a51aabe3 libc.so.6`free + 115
    frame #24: 0x00007f91a19482bd libyaml-0.so.2`yaml_parser_delete + 77
    frame #25: 0x00007f91a4c49bb9 psych.so`dealloc(ptr=0x000056542efbf0e0) at psych_parser.c:48:5
    frame #26: 0x000056542d525a2c ruby`finalize_deferred_heap_pages at gc.c:4241:2
    frame #27: 0x000056542d5307f6 ruby`rb_objspace_call_finalizer at gc.c:4427:5
    frame #28: 0x000056542d509de0 ruby`rb_ec_cleanup at eval.c:164:5
    frame #29: 0x000056542d50a65c ruby`ruby_run_node at eval.c:321:12
    frame #30: 0x000056542d5041fa ruby`rb_main(argc=7, argv=0x00007ffeffea7e48) at main.c:47:9
    frame #31: 0x000056542d504230 ruby`main(argc=7, argv=0x00007ffeffea7e48) at main.c:56:10
    frame #32: 0x00007f91a513a310 libc.so.6`__libc_start_call_main + 128
    frame #33: 0x00007f91a513a3c1 libc.so.6`__libc_start_main@@GLIBC_2.34 + 129
    frame #34: 0x000056542d504095 ruby`_start + 37

Updated by ioquatix (Samuel Williams) about 1 month ago

After debugging the above issue I found a use-after-free bug in

static void
autoload_c_free(void *ptr)
{
    struct autoload_const *ac = ptr;
    ccan_list_del(&ac->cnode); // <- can mess up other unrelated objects if parent linked list head is already freed.
    xfree(ac);
}

This only becomes an issue because we fixed:

static void
autoload_i_free(void *ptr)
{
    struct autoload_data_i *p = ptr;

    /* we may leak some memory at VM shutdown time, no big deal */
    if (ccan_list_empty(&p->constants)) {
	xfree(p);
    }
}

In the above code, p->constants was (almost) never empty so the underlying data was never freed.

Updated by ioquatix (Samuel Williams) about 1 month ago

Okay, so I suspected some issue with autoload_delete and finally found how to reproduce it.

static void
autoload_delete(VALUE mod, ID id)
{
    RB_VM_LOCK_ENTER();

    st_data_t val, load = 0, n = id;

    if (st_lookup(RCLASS_IV_TBL(mod), (st_data_t)autoload, &val)) {
        struct st_table *tbl = check_autoload_table((VALUE)val);
        struct autoload_data *ele;
        struct autoload_const *ac;

        st_delete(tbl, &n, &load);

        /* Qfalse can indicate already deleted */
        if (load != Qfalse) {
            ele = get_autoload_data((VALUE)load, &ac);
            VM_ASSERT(ele);
            /*
             * we must delete here to avoid "already initialized" warnings
             * with parallel autoload.  Using list_del_init here so list_del
             * works in autoload_c_free
             */
            ccan_list_del_init(&ac->cnode);

            if (tbl->num_entries == 0) {
                n = autoload;
                st_delete(RCLASS_IV_TBL(mod), &n, &val);
            }
        }
    }

    RB_VM_LOCK_LEAVE();

    rb_thread_schedule(); // add this to create odd behaviour (eventually).
}

Test case:

module MyModule; end

file = "autoload_does_not_exist.rb"

190.times do
	puts "."
	start = false

	threads = Array.new(10) do
		Thread.new do
			Thread.pass until start
			begin
				MyModule.autoload :FileDoesNotExist, file
				Thread.pass
				MyModule::FileDoesNotExist
			rescue LoadError => e
				e
			ensure
				Thread.pass
			end
		end
	end

	start = true
	threads.each { |t|
		raise "Failed" unless t.value.is_a?(LoadError)
	}
end

Failure example:

.
#<Thread:0x00007fafd3665c08 ../test.rb:35 run> terminated with exception (report_on_exception is true):
../test.rb:40:in `block (3 levels) in <main>': uninitialized constant MyModule::FileDoesNotExist (NameError)
../test.rb:40:in `block (3 levels) in <main>': uninitialized constant MyModule::FileDoesNotExist (NameError)
make: *** [uncommon.mk:1250: runruby] Error 1

Updated by ioquatix (Samuel Williams) about 1 month ago

@fxn (Xavier Noria) said:

Once the trigger has been executed, it is gone, the way I see it. How is that represented internally, I don't care and should not drive the design of the interface in my view. To me, autoload is not metadata of the module you need to preserve and autoload? should say if there's an autoload not yet triggered in the class/module or ancestors or not.

Maybe, but how it actually works is a lot more nuanced.

Consider multiple threads trying to require a feature which is slow to load. Let's say it's a web sever, and the first thread is cancelled because the request is cancelled. Should this cause all subsequent attempts to use the constant to fail? If you think of it like a trigger, the first thread fires the trigger, fails to load the file, and now it's gone and can never be loaded. But if you look at it as more of a gate or doorway, on transitioning through the doorway, you may eventually get the constant, even if prior threads failed to make it all the way through.

The doorway style is safer because a transient failure won't cause the constant load to fail outright and allows subsequent threads to retry. It's also easier from an implementation point of view to NOT mutate the autoload state while it's being loaded which introduces more complexity in the locking and state management. Essentially I prefer immutable autoload state + read-only operations during load. Deleting the state is a write operation which significantly complicates the internal locking logic.

Updated by byroot (Jean Boussier) about 1 month ago

Consider multiple threads trying to require a feature which is slow to load. Let's say it's a web sever, and the first thread is cancelled because the request is cancelled. Should this cause all subsequent attempts to use the constant to fail?

Unless I misunderstood, that's exactly what would happen with explicit require calls, as the file would end up in $LOADED_FEATURES, so subsequence require calls would be a noop.

I'm not gonna pretend it's graceful, but it's somewhat consistent.

Updated by ioquatix (Samuel Williams) about 1 month ago

@byroot (Jean Boussier) Does this align up with what you are saying?

features = $LOADED_FEATURES.dup

path = "./sleepy.rb"

File.write(path, "sleep 5")

thread = Thread.new do
	require path
end

sleep 1

thread.kill
thread.join

pp $LOADED_FEATURES - features

Because it's empty.

Actions #9

Updated by normalperson (Eric Wong) about 1 month ago

"ioquatix (Samuel Williams)" wrote:

@normalperson (Eric Wong) I've attached patch files. If you have time, I'd really appreciate your feedback on this. Specifically, patch 5 - freeing this unconditionally causes the Ruby VM to hang in test-bundler-parallel and I'm not sure why. Here is a backtrace of a hung process:

Sorry, I don't have time to look at it. Maybe late June or July. Thanks.

Actions

Also available in: Atom PDF