Bug #18782
closedRace conditions in autoload when loading the same feature with multiple threads.
Description
I have identified several race conditions in the autoload code.
- It's possible to race on adding and then deleting items in
autoload_featuremap
. When this happens, two threads will try to load the same file with different autoload data and deadlock. - When finishing autoload, it's necessary to clear
ele->state
before setting constants. If this is not synchronised, a thread can see the clearedele->state
before seeing the constants and assume the constant is not being autoloaded and then fail withNameError
.
This test case can reproduce both cases:
# test.rb
autoload_path = File.join(__dir__, "foobar.rb")
File.write(autoload_path, 'module Foo; end; module Bar; end')
100_000.times do
$stderr.puts "--------------------"
autoload :Foo, autoload_path
autoload :Bar, autoload_path
t1 = Thread.new {Foo}
t2 = Thread.new {Bar}
t1.join
t2.join
Object.send(:remove_const, :Foo)
Object.send(:remove_const, :Bar)
$LOADED_FEATURES.delete(autoload_path)
end
Example failure of case (1):
-------------------- (success)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
check_autoload_required 2
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd678be780
check_autoload_required 4
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd678be780
check_autoload_required 4
ele=0x55f33b806a30 ele->state=0x7fdd678be780 = NULL
check_autoload_required 4
-------------------- (failure)
autoload_by_someone_else ele=0x55f33b806a30 ele->state=(nil)
autoload_by_someone_else ele=0x55f33b6e8f40 ele->state=(nil)
check_autoload_required 2
check_autoload_required 3
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd6779d780
check_autoload_required 1
autoload_by_someone_else ele=0x55f33b806a30 ele->state=0x7fdd6779d780
check_autoload_required 1
ele=0x55f33b806a30 ele->state=0x7fdd6779d780 = NULL
ele=0x55f33b6e8f40 ele->state=0x7fdd678be780 = NULL
../test.rb:12:in `join': No live threads left. Deadlock? (fatal)
3 threads, 3 sleeps current:0x000055f33b771250 main thread:0x000055f33b66e090
* #<Thread:0x00007fdd6a2cb0b0 sleep_forever>
rb_thread_t:0x000055f33b66e090 native:0x00007fdd6a71c3c0 int:0
* #<Thread:0x00007fdd676e0090 ../test.rb:9 sleep_forever>
rb_thread_t:0x000055f33b770ff0 native:0x00007fdd6789e640 int:1 mutex:0x000055f33b7c5100 cond:1
depended by: tb_thread_id:0x000055f33b66e090
* #<Thread:0x00007fdd676e1238 ../test.rb:10 sleep_forever>
rb_thread_t:0x000055f33b771250 native:0x00007fdd679bf640 int:0
from ../test.rb:12:in `block in <main>'
from ../test.rb:4:in `times'
from ../test.rb:4:in `<main>'
make: *** [uncommon.mk:1250: runruby] Error 1
Example failure of case (2):
[0x7f175fe5b0c8] rb_autoload_str mod=Object id=Foo file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str const_set mod=Object id=Foo file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str mod=Object id=Bar file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe5b0c8] rb_autoload_str const_set mod=Object id=Bar file="/home/samuel/Projects/ioquatix/ruby/foobar.rb"
[0x7f175fe61d88] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] Assigning constants...
[0x7f175fe61d88] rb_const_search_from value == Qundef -> autoloading
[0x7f175fe61e78] autoload_const_set name=:Foo value=Foo
[0x7f175fe61e78] autoload_const_set name=:Bar value=Bar
#<Thread:0x00007f175fe61d88 ../test.rb:11 run> terminated with exception (report_on_exception is true):
../test.rb:11:in `block (2 levels) in <main>': uninitialized constant Bar (NameError)
../test.rb:11:in `block (2 levels) in <main>': uninitialized constant Bar (NameError)
make: *** [uncommon.mk:1250: runruby] Error 1
These failures are very uncommon but it does impact Ruby as far back as 2.7, and probably earlier.
Files
Updated by ioquatix (Samuel Williams) over 2 years ago
- File 0001-Add-RUBY_VM_CRITICAL_SECTION-for-detecting-unexpecte.patch 0001-Add-RUBY_VM_CRITICAL_SECTION-for-detecting-unexpecte.patch added
- File 0002-Prevent-race-between-GC-mark-and-autoload-setup.patch 0002-Prevent-race-between-GC-mark-and-autoload-setup.patch added
- File 0003-Protect-race-on-autoload-state.patch 0003-Protect-race-on-autoload-state.patch added
- File 0004-Increase-timeout-for-debug-tests.patch 0004-Increase-timeout-for-debug-tests.patch added
- File 0005-Revert-removal-of-non-conditional-xfree.patch 0005-Revert-removal-of-non-conditional-xfree.patch added
@normalperson (Eric Wong) I've attached patch files. If you have time, I'd really appreciate your feedback on this. Specifically, patch 5 - freeing this unconditionally causes the Ruby VM to hang in test-bundler-parallel
and I'm not sure why. Here is a backtrace of a hung process:
* thread #1, name = 'ruby', stop reason = signal SIGSTOP
* frame #0: 0x00007f91a51972cb libc.so.6`__lll_lock_wait_private + 43
frame #1: 0x00007f91a51ab6e0 libc.so.6`__libc_calloc + 816
frame #2: 0x00007f91a5542823 ld-linux-x86-64.so.2`_dl_new_object + 115
frame #3: 0x00007f91a553ded3 ld-linux-x86-64.so.2`_dl_map_object_from_fd + 179
frame #4: 0x00007f91a553f7ac ld-linux-x86-64.so.2`_dl_map_object + 508
frame #5: 0x00007f91a55435b9 ld-linux-x86-64.so.2`dl_open_worker_begin + 169
frame #6: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
frame #7: 0x00007f91a5542d7b ld-linux-x86-64.so.2`dl_open_worker + 59
frame #8: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
frame #9: 0x00007f91a554315d ld-linux-x86-64.so.2`_dl_open + 173
frame #10: 0x00007f91a5268211 libc.so.6`do_dlopen + 65
frame #11: 0x00007f91a5267e18 libc.so.6`_dl_catch_exception + 136
frame #12: 0x00007f91a5267ee3 libc.so.6`_dl_catch_error + 51
frame #13: 0x00007f91a5268170 libc.so.6`dlerror_run + 64
frame #14: 0x00007f91a526832a libc.so.6`__libc_dlopen_mode + 74
frame #15: 0x00007f91a521f191 libc.so.6`__libc_unwind_link_get + 81
frame #16: 0x00007f91a522c867 libc.so.6`__backtrace + 39
frame #17: 0x000056542d70de17 ruby`rb_vm_bugreport at vm_dump.c:762:13
frame #18: 0x000056542d7a454a ruby`rb_bug_for_fatal_signal at error.c:822:5
frame #19: 0x000056542d663c49 ruby`sigsegv(sig=<unavailable>, info=<unavailable>, ctx=<unavailable>) at signal.c:964:5
frame #20: 0x00007f91a514f560 libc.so.6`__restore_rt
frame #21: 0x00007f91a51a6ef1 libc.so.6`malloc_consolidate + 257
frame #22: 0x00007f91a51a8450 libc.so.6`_int_free + 2032
frame #23: 0x00007f91a51aabe3 libc.so.6`free + 115
frame #24: 0x00007f91a19482bd libyaml-0.so.2`yaml_parser_delete + 77
frame #25: 0x00007f91a4c49bb9 psych.so`dealloc(ptr=0x000056542efbf0e0) at psych_parser.c:48:5
frame #26: 0x000056542d525a2c ruby`finalize_deferred_heap_pages at gc.c:4241:2
frame #27: 0x000056542d5307f6 ruby`rb_objspace_call_finalizer at gc.c:4427:5
frame #28: 0x000056542d509de0 ruby`rb_ec_cleanup at eval.c:164:5
frame #29: 0x000056542d50a65c ruby`ruby_run_node at eval.c:321:12
frame #30: 0x000056542d5041fa ruby`rb_main(argc=7, argv=0x00007ffeffea7e48) at main.c:47:9
frame #31: 0x000056542d504230 ruby`main(argc=7, argv=0x00007ffeffea7e48) at main.c:56:10
frame #32: 0x00007f91a513a310 libc.so.6`__libc_start_call_main + 128
frame #33: 0x00007f91a513a3c1 libc.so.6`__libc_start_main@@GLIBC_2.34 + 129
frame #34: 0x000056542d504095 ruby`_start + 37
Updated by ioquatix (Samuel Williams) over 2 years ago
After debugging the above issue I found a use-after-free bug in
static void
autoload_c_free(void *ptr)
{
struct autoload_const *ac = ptr;
ccan_list_del(&ac->cnode); // <- can mess up other unrelated objects if parent linked list head is already freed.
xfree(ac);
}
This only becomes an issue because we fixed:
static void
autoload_i_free(void *ptr)
{
struct autoload_data_i *p = ptr;
/* we may leak some memory at VM shutdown time, no big deal */
if (ccan_list_empty(&p->constants)) {
xfree(p);
}
}
In the above code, p->constants
was (almost) never empty so the underlying data was never freed.
Updated by ioquatix (Samuel Williams) over 2 years ago
Updated by ioquatix (Samuel Williams) over 2 years ago
Okay, so I suspected some issue with autoload_delete
and finally found how to reproduce it.
static void
autoload_delete(VALUE mod, ID id)
{
RB_VM_LOCK_ENTER();
st_data_t val, load = 0, n = id;
if (st_lookup(RCLASS_IV_TBL(mod), (st_data_t)autoload, &val)) {
struct st_table *tbl = check_autoload_table((VALUE)val);
struct autoload_data *ele;
struct autoload_const *ac;
st_delete(tbl, &n, &load);
/* Qfalse can indicate already deleted */
if (load != Qfalse) {
ele = get_autoload_data((VALUE)load, &ac);
VM_ASSERT(ele);
/*
* we must delete here to avoid "already initialized" warnings
* with parallel autoload. Using list_del_init here so list_del
* works in autoload_c_free
*/
ccan_list_del_init(&ac->cnode);
if (tbl->num_entries == 0) {
n = autoload;
st_delete(RCLASS_IV_TBL(mod), &n, &val);
}
}
}
RB_VM_LOCK_LEAVE();
rb_thread_schedule(); // add this to create odd behaviour (eventually).
}
Test case:
module MyModule; end
file = "autoload_does_not_exist.rb"
190.times do
puts "."
start = false
threads = Array.new(10) do
Thread.new do
Thread.pass until start
begin
MyModule.autoload :FileDoesNotExist, file
Thread.pass
MyModule::FileDoesNotExist
rescue LoadError => e
e
ensure
Thread.pass
end
end
end
start = true
threads.each { |t|
raise "Failed" unless t.value.is_a?(LoadError)
}
end
Failure example:
.
#<Thread:0x00007fafd3665c08 ../test.rb:35 run> terminated with exception (report_on_exception is true):
../test.rb:40:in `block (3 levels) in <main>': uninitialized constant MyModule::FileDoesNotExist (NameError)
../test.rb:40:in `block (3 levels) in <main>': uninitialized constant MyModule::FileDoesNotExist (NameError)
make: *** [uncommon.mk:1250: runruby] Error 1
Updated by ioquatix (Samuel Williams) over 2 years ago
@fxn (Xavier Noria) said:
Once the trigger has been executed, it is gone, the way I see it. How is that represented internally, I don't care and should not drive the design of the interface in my view. To me, autoload is not metadata of the module you need to preserve and autoload? should say if there's an autoload not yet triggered in the class/module or ancestors or not.
Maybe, but how it actually works is a lot more nuanced.
Consider multiple threads trying to require a feature which is slow to load. Let's say it's a web sever, and the first thread is cancelled because the request is cancelled. Should this cause all subsequent attempts to use the constant to fail? If you think of it like a trigger, the first thread fires the trigger, fails to load the file, and now it's gone and can never be loaded. But if you look at it as more of a gate or doorway, on transitioning through the doorway, you may eventually get the constant, even if prior threads failed to make it all the way through.
The doorway style is safer because a transient failure won't cause the constant load to fail outright and allows subsequent threads to retry. It's also easier from an implementation point of view to NOT mutate the autoload state while it's being loaded which introduces more complexity in the locking and state management. Essentially I prefer immutable autoload state + read-only operations during load. Deleting the state is a write operation which significantly complicates the internal locking logic.
Updated by byroot (Jean Boussier) over 2 years ago
Consider multiple threads trying to require a feature which is slow to load. Let's say it's a web sever, and the first thread is cancelled because the request is cancelled. Should this cause all subsequent attempts to use the constant to fail?
Unless I misunderstood, that's exactly what would happen with explicit require
calls, as the file would end up in $LOADED_FEATURES
, so subsequence require
calls would be a noop.
I'm not gonna pretend it's graceful, but it's somewhat consistent.
Updated by ioquatix (Samuel Williams) over 2 years ago
@byroot (Jean Boussier) Does this align up with what you are saying?
features = $LOADED_FEATURES.dup
path = "./sleepy.rb"
File.write(path, "sleep 5")
thread = Thread.new do
require path
end
sleep 1
thread.kill
thread.join
pp $LOADED_FEATURES - features
Because it's empty.
Updated by byroot (Jean Boussier) over 2 years ago
@ioquatix (Samuel Williams) indeed, my bad.
Updated by normalperson (Eric Wong) over 2 years ago
"ioquatix (Samuel Williams)" noreply@ruby-lang.org wrote:
@normalperson (Eric Wong) I've attached patch files. If you have time, I'd really appreciate your feedback on this. Specifically, patch 5 - freeing this unconditionally causes the Ruby VM to hang in
test-bundler-parallel
and I'm not sure why. Here is a backtrace of a hung process:
Sorry, I don't have time to look at it. Maybe late June or July. Thanks.
Updated by ioquatix (Samuel Williams) over 2 years ago
- Status changed from Open to Closed
- Assignee set to ioquatix (Samuel Williams)
I've confirmed after my PRs, this issue cannot be reproduced on HEAD.