Bug #19580
closedEnsure ruby_xfree won't segfault if called after ruby_vm_destruct
Description
Github PR: https://github.com/ruby/ruby/pull/7663
The POSIX Threads API provides the ability to define a destructor to clean up thread-local storage at thread exit by using pthread_key_create
to bind a storage key and an associated destructor function, and pthread_setspecific
to associate a value with that key. If a C extension is using ruby_xmalloc
and ruby_xfree
to manage memory, then the destructor function will likely call ruby_xfree
.
There is a small window of time as a process is exiting -- after ruby_vm_destruct
but before the process exits -- in which a thread may exit and the destructor function called, leading to a segfault.
Real-world use case¶
The real-world scenario motivating this change is libxml2's pthread code which uses pthread_key_create
to set up a destructor that is called at thread exit to clean up thread-local storage associated with the thread by pthread_setspecific
.
Nokogiri has, since 2009, configured libxml2 to use ruby_xmalloc
and ruby_xfree
so that Ruby's memory management is aware of the allocated memory. As a result, ruby_xfree
is being called by the pthread destructor, which may lead to a segfault that looks like:
[BUG] Segmentation fault at 0x0000000000000420
ruby 3.3.0dev (2023-03-19T06:13:25Z master ca9355e173) [x86_64-linux]
-- Machine register context ------------------------------------------------
RIP: 0x00007f9b6311124e RBP: 0x00007f9b58001530 RSP: 0x00007f9b5d8eed60
RAX: 0x0000000000000000 RBX: 0x00007f9b58001888 RCX: 0x00000000000003ff
RDX: 0x00007f9b5dad62b0 RDI: 0x00007f9b58007090 RSI: 0x0000000000000000
R8: 0x00007f9b5d8eed64 R9: 0x00000000000000ca R10: 0x0000000000000000
R11: 0x0000000000000246 R12: 0x00007f9b58007090 R13: 0x00007f9b62e1bae8
R14: 0x0000000000000004 R15: 0x00007f9b5d8efb58 EFL: 0x0000000000010206
-- C level backtrace information -------------------------------------------
.../3.3.0-dev/lib/libruby.so.3.3(rb_print_backtrace+0xd) [0x7f9b632e6f1f] /home/flavorjones/code/oss/ruby/vm_dump.c:785
.../3.3.0-dev/lib/libruby.so.3.3(rb_vm_bugreport) /home/flavorjones/code/oss/ruby/vm_dump.c:1101
.../3.3.0-dev/lib/libruby.so.3.3(rb_bug_for_fatal_signal+0xf4) [0x7f9b630eead4] /home/flavorjones/code/oss/ruby/error.c:813
.../3.3.0-dev/lib/libruby.so.3.3(sigsegv+0x4d) [0x7f9b63233bbd] /home/flavorjones/code/oss/ruby/signal.c:964
/lib/x86_64-linux-gnu/libc.so.6(0x7f9b62c42520) [0x7f9b62c42520]
.../3.3.0-dev/lib/libruby.so.3.3(ruby_sized_xfree+0x3) [0x7f9b6311124e] /home/flavorjones/code/oss/ruby/gc.c:12666
.../3.3.0-dev/lib/libruby.so.3.3(ruby_sized_xfree) /home/flavorjones/code/oss/ruby/gc.c:12663
.../lib/nokogiri/nokogiri.so(xmlResetError+0x0) [0x7f9b5da6ab4a] .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/error.c:880
.../lib/nokogiri/nokogiri.so(xmlResetError) .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/error.c:873
.../lib/nokogiri/nokogiri.so(xmlResetError) (null):0
.../lib/nokogiri/nokogiri.so(xmlFreeGlobalState+0x14) [0x7f9b5dad62c4] .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/threads.c:548
/lib/x86_64-linux-gnu/libc.so.6(__nptl_deallocate_tsd+0x77) [0x7f9b62c91711] ./nptl/nptl_deallocate_tsd.c:73
/lib/x86_64-linux-gnu/libc.so.6(__nptl_deallocate_tsd) ./nptl/nptl_deallocate_tsd.c:22
/lib/x86_64-linux-gnu/libc.so.6(start_thread+0x17a) [0x7f9b62c949ca] ./nptl/pthread_create.c:453
[0x7f9b62d26a00]
I've reproduced this in Ruby 3.0, 3.1, 3.2, and master by using the following ruby code:
# must have an error in it to cause pthread_setspecific to be called
html = "<div foo='asdf>asdf</div>"
Thread.new { Nokogiri::HTML4::Document.parse(html) }
sleep 3 # THREAD_CACHE_TIME
exit 0
The timing is admittedly hard to reproduce, but Gitlab has seen this in their CI pipelines a few dozen times and it's made easier if the process lifetime is extended by an atexit
-registered function.