Project

General

Profile

Actions

Bug #14883

closed

Ruby 2.5 Fails to Build on PowerPC 32-bit (BE)

Added by mingcongbai (Mingcong Bai) almost 6 years ago. Updated over 5 years ago.

Status:
Third Party's Issue
Assignee:
-
Target version:
-
[ruby-core:87711]

Description

When building Ruby 2.5.1 on a PowerPC 32-bit (Big Endian) host, the build fails as it generates rdoc - segmentation fault. The build log is as follows:

https://pastebin.aosc.io/paste/jJWjWPadcmJeLEkvpgnMqQ

Configure parameters...

...
--enable-shared
--disable-rpath
--with-dbm-type=gdbm_compat

On the same host with the same build environment, Ruby 2.4 builds just fine. The current Git master exhibits the same issue.

Updated by mingcongbai (Mingcong Bai) almost 6 years ago

mingcongbai (Mingcong Bai) wrote:

When building Ruby 2.5.1 on a PowerPC 32-bit (Big Endian) host, the build fails as it generates rdoc - segmentation fault. The build log is as follows:

https://pastebin.aosc.io/paste/jJWjWPadcmJeLEkvpgnMqQ

Configure parameters...

...
--enable-shared
--disable-rpath
--with-dbm-type=gdbm_compat

On the same host with the same build environment, Ruby 2.4 builds just fine. The current Git master exhibits the same issue.

This issue is introduced with the following commit, as found out with Git bisect.

214a7f8d49c7b59d06f5e2e3e1a8a3567ab7c570 is the first bad commit
commit 214a7f8d49c7b59d06f5e2e3e1a8a3567ab7c570
Author: hsbt
Date: Tue Sep 12 03:42:54 2017 +0000

Merge rdoc-6.0.0.beta2 from upstream.

  * This version changed lexer used Ripper from lexer based IRB.
    see details: https://github.com/ruby/rdoc/pull/512

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59845 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

:040000 040000 202159c0e7aacd1789cd0cc069c8ed42aa4cc879 a07b6d286d9e01946efe000a64b5c5f2be4e9456 M lib
:040000 040000 defd7de28deef546819fbbe7162489e58b90b939 2363964b0b5be02026591f60706467c512bc8647 M test

Updated by naruse (Yui NARUSE) almost 6 years ago

  • Status changed from Open to Feedback

Your build/SEGV log doesn't have the essential part of crash log.

Updated by lion (Daming Yang) almost 6 years ago

naruse (Yui NARUSE) wrote:

Your build/SEGV log doesn't have the essential part of the crash log.

Bai's friend here. The missing part of the crash log: https://pastebin.aosc.io/paste/fBNM~Lr~C4vfsvHCpmPJOA

Updated by lion (Daming Yang) almost 6 years ago

I did some further investigation. It is a little bit longer story than I thought.

Firstly, the callstacks in Ruby shows that:

...
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:594:in `parse'
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:308:in `get_squashed_tk'
/root/ruby/lib/rdoc/parser/ripper_state_lex.rb:308:in `next'
(End of list)

I thought that might be a null pointer error but it wasn't. Ruby SEGVs, even for executing a simple Ruby program uses Enumerator.next.

# cat lion_each.rb
fib = Enumerator.new do |y|
  y << "FOO"
  y << "BAR"
end
fib.each do |s|
  puts s
end

is Good.

# cat lion_next.rb
fib = Enumerator.new do |y|
  y << "FOO"
  y << "BAR"
end
puts fib.next

Segmentation fault.

Now let us look at the C backtrace:

-- C level backtrace information -------------------------------------------
/root/ruby/ruby(rb_vm_bugreport+0xa0) [0x207f8908] vm_dump.c:703
/root/ruby/ruby(rb_bug_context+0xd8) [0x207ed7e8] error.c:580
/root/ruby/ruby(sigsegv+0x64) [0x206bd3c4] signal.c:928
linux-vdso32.so.1(0x100420) [0x100420]
[0x201a720c]
/root/ruby/ruby(rb_fiber_resume+0x294) [0x207c102c] vm_core.h:1691
/root/ruby/ruby(enumerator_next+0xe4) [0x207e67d4] enumerator.c:705

(According to the memory mapping, 0x201a720c is in libc.so.)

I opted out the $optflags (dropped -O3) and I got a more detailed backtrace, showed that the failed C library function is at here:
https://github.com/ruby/ruby/blob/v2_5_1/cont.c#L827

#else /* not WIN32 */
    ucontext_t *context = &fib->context;
    char *ptr;
    STACK_GROW_DIR_DETECTION;

    getcontext(context);
    ptr = fiber_machine_stack_alloc(size);
    context->uc_link = NULL;
    context->uc_stack.ss_sp = ptr;
    context->uc_stack.ss_size = size;
    fib->ss_sp = ptr;
    fib->ss_size = size;
    makecontext(context, rb_fiber_start, 0);  # !! HERE !!
    sec->machine.stack_start = (VALUE*)(ptr + STACK_DIR_UPPER(0, size));
    sec->machine.stack_maxsize = size - RB_PAGE_SIZE;
#endif

The getcontext() function was actually failed. I recompiled ruby with some modification to check the return value of getcontext(), it was -1 and errno was EPERM(1).
So now the direct reason of segmentation fault is clear and simple: context has not been filled with correct data, makecontext() tried to set and switch to an unknown "context", probably to read or execute on null pointers.

This bug was introduced at

commit c462c50d6336a0c7823ff8cfce51f8bff22eeff6
Author: ko1 <ko1@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date:   Wed May 5 18:37:37 2010 +0000

    * cont.c: apply FIBER_USE_NATIVE patch.  This patch improve
      Fiber context switching cost using system APIs.  Detail comments
      are written in cont.c.
    
    git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@27635 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

from v1_9_2_preview3.

I compiled trunk (5235d57b0700c9f67892847d7f8fa055e98ca4d6, ahead of v2_6_0_preview2). It can still reproduce now, though it has moved to another place.

/root/ruby/miniruby(fiber_initialize_machine_stack_context+0x90) [0x20622a3c] cont.c:896
/root/ruby/miniruby(fiber_store+0x8c) [0x20623e04] cont.c:1668
/root/ruby/miniruby(fiber_switch+0x1d0) [0x2062411c] cont.c:1743
/root/ruby/miniruby(rb_fiber_resume+0xdc) [0x206244d0] cont.c:1824
...

Conclusion

  1. RDoc uses Enumerator and triggers this bug.
  2. Due to some unknown reasons, the getcontext() failed,
  3. thus Ruby could not create a fibre correctly.
  4. Ruby switched to the malformed user context anyway, SEGV.

For more friendly error handling, I suggest we check the return value of getcontext():

if (getcontext(context) < 0) {
    rb_bug("can't get user context: %s", strerror(errno));
}

(Would you like to rephrase the title a little bit, Bai, wouldn't yeh?) (O v O )

Updated by lion (Daming Yang) almost 6 years ago

contd.

When it comes to a glibc function getcontext(), 'the root issue' of the bug, we can check it in this way:

// cat lion_getcontext.c
#include <ucontext.h>
#include <errno.h>
int main() {
    ucontext_t c;
    if (getcontext(&c) < 0)
        return errno;
    return 0;
}
// gcc lion_getcontext.c && ./a.out; echo $?

Note that the 'host' Bai has, is 32-bit (PPC32) userspace container(systemd-nspawn) on a 64-bit (PPC64) kernel.

Although the container is a kind of virtualisation, I think it does matter when it comes to glibc and syscall.

After running the test program, I got

  • 0 on my amd64 laptop,
  • 0 on Bai's PPC64 machine,
  • 1(EPERM) on the PPC32-over-PPC64 container. This is why Ruby SEGV'ed on here.

Use strace ./a.out to trace syscalls. Interesting, the underlying syscalls are different.

AMD64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
PPC64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
PPC32-over-PPC64: swapcontext(0xfffa8dc8, 0) = -1 EPERM (Operation not permitted)

But it was weird.
It seems that neither https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_64.c#L588
nor https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_32.c#L1121
has return -EPERM; or does likewise.

Why was the errno EPERM? SELinux is disabled.
Is it a Linux capability issue only if programs run in a systemd-nspawn container?

Updated by lion (Daming Yang) almost 6 years ago

Bingo. It is also an issue for the systemd-nspawn container indeed.

systemd-nspawn have used a libseccomp whitelist instead of a blacklist by default since last year.
rt_sigprocmask() & sigprocmask() are on the whitelist, but swapcontext() is not.

Addressed the container issue at https://github.com/systemd/systemd/issues/9485

You may add --system-call-filter=swapcontext on your systemd-nspawn command line as a workaround.

https://github.com/ruby/ruby/pull/1903
An enhancement for reporting the error of getcontext().

lion (Daming Yang) wrote:

contd.

When it comes to a glibc function getcontext(), 'the root issue' of the bug, we can check it in this way:

// cat lion_getcontext.c
#include <ucontext.h>
#include <errno.h>
int main() {
    ucontext_t c;
    if (getcontext(&c) < 0)
        return errno;
    return 0;
}
// gcc lion_getcontext.c && ./a.out; echo $?

Note that the 'host' Bai has, is 32-bit (PPC32) userspace container(systemd-nspawn) on a 64-bit (PPC64) kernel.

Although the container is a kind of virtualisation, I think it does matter when it comes to glibc and syscall.

After running the test program, I got

  • 0 on my amd64 laptop,
  • 0 on Bai's PPC64 machine,
  • 1(EPERM) on the PPC32-over-PPC64 container. This is why Ruby SEGV'ed on here.

Use strace ./a.out to trace syscalls. Interesting, the underlying syscalls are different.

AMD64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
PPC64: rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
PPC32-over-PPC64: swapcontext(0xfffa8dc8, 0) = -1 EPERM (Operation not permitted)

But it was weird.
It seems that neither https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_64.c#L588
nor https://github.com/torvalds/linux/blob/v4.8/arch/powerpc/kernel/signal_32.c#L1121
has return -EPERM; or does likewise.

Why was the errno EPERM? SELinux is disabled.
Is it a Linux capability issue only if programs run in a systemd-nspawn container?

Updated by mingcongbai (Mingcong Bai) over 5 years ago

  • Status changed from Feedback to Third Party's Issue

Issue with systemd containers, not with Ruby.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0