Bug #14883
closedRuby 2.5 Fails to Build on PowerPC 32-bit (BE)
Description
When building Ruby 2.5.1 on a PowerPC 32-bit (Big Endian) host, the build fails as it generates rdoc - segmentation fault. The build log is as follows:
https://pastebin.aosc.io/paste/jJWjWPadcmJeLEkvpgnMqQ
Configure parameters...
...
--enable-shared
--disable-rpath
--with-dbm-type=gdbm_compat
On the same host with the same build environment, Ruby 2.4 builds just fine. The current Git master exhibits the same issue.
Updated by mingcongbai (Mingcong Bai) over 6 years ago
mingcongbai (Mingcong Bai) wrote:
When building Ruby 2.5.1 on a PowerPC 32-bit (Big Endian) host, the build fails as it generates rdoc - segmentation fault. The build log is as follows:
https://pastebin.aosc.io/paste/jJWjWPadcmJeLEkvpgnMqQ
Configure parameters...
...
--enable-shared
--disable-rpath
--with-dbm-type=gdbm_compatOn the same host with the same build environment, Ruby 2.4 builds just fine. The current Git master exhibits the same issue.
This issue is introduced with the following commit, as found out with Git bisect.
214a7f8d49c7b59d06f5e2e3e1a8a3567ab7c570 is the first bad commit
commit 214a7f8d49c7b59d06f5e2e3e1a8a3567ab7c570
Author: hsbt hsbt@b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Date: Tue Sep 12 03:42:54 2017 +0000
Merge rdoc-6.0.0.beta2 from upstream.
* This version changed lexer used Ripper from lexer based IRB.
see details: https://github.com/ruby/rdoc/pull/512
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59845 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
:040000 040000 202159c0e7aacd1789cd0cc069c8ed42aa4cc879 a07b6d286d9e01946efe000a64b5c5f2be4e9456 M lib
:040000 040000 defd7de28deef546819fbbe7162489e58b90b939 2363964b0b5be02026591f60706467c512bc8647 M test
Updated by naruse (Yui NARUSE) over 6 years ago
- Status changed from Open to Feedback
Your build/SEGV log doesn't have the essential part of crash log.
Updated by lion (Daming Yang) over 6 years ago
naruse (Yui NARUSE) wrote:
Your build/SEGV log doesn't have the essential part of the crash log.
Bai's friend here. The missing part of the crash log: https://pastebin.aosc.io/paste/fBNM~Lr~C4vfsvHCpmPJOA
Updated by lion (Daming Yang) over 6 years ago
I did some further investigation. It is a little bit longer story than I thought.
Firstly, the callstacks in Ruby shows that:
...
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:594:in `parse'
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:308:in `get_squashed_tk'
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:308:in `next'
(End of list)
I thought that might be a null pointer error but it wasn't. Ruby SEGVs, even for executing a simple Ruby program uses Enumerator.next.
# cat lion_each.rb
fib = Enumerator.new do |y|
y << "FOO"
y << "BAR"
end
fib.each do |s|
puts s
end
is Good.
# cat lion_next.rb
fib = Enumerator.new do |y|
y << "FOO"
y << "BAR"
end
puts fib.next
Segmentation fault.
Now let us look at the C backtrace:
-- C level backtrace information -------------------------------------------
/root/ruby/ruby(rb_vm_bugreport+0xa0) [0x207f8908] vm_dump.c:703
/root/ruby/ruby(rb_bug_context+0xd8) [0x207ed7e8] error.c:580
/root/ruby/ruby(sigsegv+0x64) [0x206bd3c4] signal.c:928
linux-vdso32.so.1(0x100420) [0x100420]
[0x201a720c]
/root/ruby/ruby(rb_fiber_resume+0x294) [0x207c102c] vm_core.h:1691
/root/ruby/ruby(enumerator_next+0xe4) [0x207e67d4] enumerator.c:705
(According to the memory mapping, 0x201a720c
is in libc.so
.)
I opted out the $optflags (dropped -O3) and I got a more detailed backtrace, showed that the failed C library function is at here:
https://github.com/ruby/ruby/blob/v2_5_1/cont.c#L827
#else /* not WIN32 */
ucontext_t *context = &fib->context;
char *ptr;
STACK_GROW_DIR_DETECTION;
getcontext(context);
ptr = fiber_machine_stack_alloc(size);
context->uc_link = NULL;
context->uc_stack.ss_sp = ptr;
context->uc_stack.ss_size = size;
fib->ss_sp = ptr;
fib->ss_size = size;
makecontext(context, rb_fiber_start, 0); # !! HERE !!
sec->machine.stack_start = (VALUE*)(ptr + STACK_DIR_UPPER(0, size));
sec->machine.stack_maxsize = size - RB_PAGE_SIZE;
#endif
The getcontext()
function was actually failed. I recompiled ruby with some modification to check the return value of getcontext()
, it was -1 and errno
was EPERM(1).
So now the direct reason of segmentation fault is clear and simple: context
has not been filled with correct data, makecontext()
tried to set and switch to an unknown "context", probably to read or execute on null pointers.
This bug was introduced at
commit c462c50d6336a0c7823ff8cfce51f8bff22eeff6
Author: ko1 <ko1@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date: Wed May 5 18:37:37 2010 +0000
* cont.c: apply FIBER_USE_NATIVE patch. This patch improve
Fiber context switching cost using system APIs. Detail comments
are written in cont.c.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@27635 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
from v1_9_2_preview3
.
I compiled trunk (5235d57b0700c9f67892847d7f8fa055e98ca4d6, ahead of v2_6_0_preview2). It can still reproduce now, though it has moved to another place.
/root/ruby/miniruby(fiber_initialize_machine_stack_context+0x90) [0x20622a3c] cont.c:896
/root/ruby/miniruby(fiber_store+0x8c) [0x20623e04] cont.c:1668
/root/ruby/miniruby(fiber_switch+0x1d0) [0x2062411c] cont.c:1743
/root/ruby/miniruby(rb_fiber_resume+0xdc) [0x206244d0] cont.c:1824
...
Conclusion¶
- RDoc uses
Enumerator
and triggers this bug. - Due to some unknown reasons, the
getcontext()
failed, - thus Ruby could not create a fibre correctly.
- Ruby switched to the malformed user context anyway, SEGV.
For more friendly error handling, I suggest we check the return value of getcontext()
:
if (getcontext(context) < 0) {
rb_bug("can't get user context: %s", strerror(errno));
}
(Would you like to rephrase the title a little bit, Bai, wouldn't yeh?) (O v O )
Updated by lion (Daming Yang) over 6 years ago
contd.
When it comes to a glibc function getcontext()
, 'the root issue' of the bug, we can check it in this way:
// cat lion_getcontext.c
#include <ucontext.h>
#include <errno.h>
int main() {
ucontext_t c;
if (getcontext(&c) < 0)
return errno;
return 0;
}
// gcc lion_getcontext.c && ./a.out; echo $?
Note that the 'host' Bai has, is 32-bit (PPC32) userspace container(systemd-nspawn) on a 64-bit (PPC64) kernel.
Although the container is a kind of virtualisation, I think it does matter when it comes to glibc and syscall.
After running the test program, I got
- 0 on my amd64 laptop,
- 0 on Bai's PPC64 machine,
- 1(EPERM) on the PPC32-over-PPC64 container. This is why Ruby SEGV'ed on here.
Use strace ./a.out
to trace syscalls. Interesting, the underlying syscalls are different.
AMD64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
PPC64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
PPC32-over-PPC64: swapcontext(0xfffa8dc8, 0) = -1 EPERM (Operation not permitted)
But it was weird.
It seems that neither https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_64.c#L588
nor https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_32.c#L1121
has return -EPERM;
or does likewise.
Why was the errno EPERM
? SELinux is disabled.
Is it a Linux capability issue only if programs run in a systemd-nspawn container?
Updated by lion (Daming Yang) over 6 years ago
Bingo. It is also an issue for the systemd-nspawn container indeed.
systemd-nspawn have used a libseccomp whitelist instead of a blacklist by default since last year.
rt_sigprocmask()
& sigprocmask()
are on the whitelist, but swapcontext()
is not.
Addressed the container issue at https://github.com/systemd/systemd/issues/9485
You may add --system-call-filter=swapcontext
on your systemd-nspawn command line as a workaround.
https://github.com/ruby/ruby/pull/1903
An enhancement for reporting the error of getcontext()
.
lion (Daming Yang) wrote:
contd.
When it comes to a glibc function
getcontext()
, 'the root issue' of the bug, we can check it in this way:// cat lion_getcontext.c #include <ucontext.h> #include <errno.h> int main() { ucontext_t c; if (getcontext(&c) < 0) return errno; return 0; } // gcc lion_getcontext.c && ./a.out; echo $?
Note that the 'host' Bai has, is 32-bit (PPC32) userspace container(systemd-nspawn) on a 64-bit (PPC64) kernel.
Although the container is a kind of virtualisation, I think it does matter when it comes to glibc and syscall.
After running the test program, I got
- 0 on my amd64 laptop,
- 0 on Bai's PPC64 machine,
- 1(EPERM) on the PPC32-over-PPC64 container. This is why Ruby SEGV'ed on here.
Use
strace ./a.out
to trace syscalls. Interesting, the underlying syscalls are different.AMD64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 PPC64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 PPC32-over-PPC64: swapcontext(0xfffa8dc8, 0) = -1 EPERM (Operation not permitted)
But it was weird.
It seems that neither https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_64.c#L588
nor https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_32.c#L1121
hasreturn -EPERM;
or does likewise.Why was the errno
EPERM
? SELinux is disabled.
Is it a Linux capability issue only if programs run in a systemd-nspawn container?
Updated by mingcongbai (Mingcong Bai) about 6 years ago
- Status changed from Feedback to Third Party's Issue
Issue with systemd containers, not with Ruby.