Bug #5244

Continuation causes Bus Error on Debian sparc

Added by Lucas Nussbaum over 2 years ago. Updated over 2 years ago.

[ruby-core:39162]
Status:Closed
Priority:Normal
Assignee:Naohisa Goto
Category:-
Target version:-
ruby -v:- Backport:

Description

Hi,

$ ./miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -rcontinuation -e 'callcc { |c| c.call }'
-e:1: [BUG] Bus Error
ruby 1.9.3dev (2011-08-26) [sparc-linux]

-- Control frame information -----------------------------------------------
c:0004 p:---- s:0009 b:0009 l:000008 d:000008 CFUNC :callcc
c:0003 p:0009 s:0006 b:0006 l:000fcc d:001d74 EVAL -e:1
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:000fcc d:000fcc TOP

-- Ruby level backtrace information ----------------------------------------
-e:1:in <main>'
-e:1:in
callcc'

-- C level backtrace information -------------------------------------------
Bus error

gdb says:
(gdb) run -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -rcontinuation -e 'callcc { |c| c.call }'
Starting program: /home/lucas/ruby1.9.1-1.9.3~preview1+svn33077/miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -rcontinuation -e 'callcc { |c| c.call }'
[Thread debugging using libthread_db enabled]
[New Thread 0xf7fc7b70 (LWP 31418)]
[Thread 0xf7fc7b70 (LWP 31418) exited]
process 31417 is executing new program: /home/lucas/ruby1.9.1-1.9.3~preview1+svn33077/ruby1.9.1
[Thread debugging using libthread_db enabled]
[New Thread 0xf79e5b70 (LWP 31419)]

Program received signal SIGBUS, Bus error.
0xf7f4d304 in contcapture (stat=Cannot access memory at address 0x49
) at cont.c:439
439 if (ruby
setjmp(cont->jmpbuf)) {

(gdb) print cont
Cannot access memory at address 0xfffffff9

ruby.patch Magnifier (480 Bytes) Jurij Smakov, 10/19/2011 08:05 AM

flush_windows.patch Magnifier (2.19 KB) Jurij Smakov, 11/03/2011 07:58 AM

Associated revisions

Revision 33492
Added by Nobuyoshi Nakada over 2 years ago

  • include/ruby/defines.h (flushregisterwindows): use software trap on Debian Sparc 32-bit userspace. [Bug #5244]

Revision 33757
Added by Naohisa Goto over 2 years ago

  • include/ruby/defines.h (FLUSHREGISTERWINDOWS): move sparc asm code to a separete file sparc.c for preventing inlining optimization. Patched by Jurij Smakov. [Bug #5244]
  • sparc.c (rbsparcflushregisterwindows): ditto.
  • configure.in: ditto.

History

#1 Updated by Jurij Smakov over 2 years ago

I poked at it a bit. While I don't understand fully what's going on, it looks like it's using setjmp/longjmp, and one thing in setjmp page caught my eye: "setjmp() saves the stack context/environment in env for later use by longjmp(3). The stack context will be invalidated if the function which called setjmp() returns." I think that such invalidation takes place here. Function rbcallcc (in cont.c) does the following:

rbcallcc(VALUE self)
{
volatile int called;
volatile VALUE val = cont
capture(&called);

if (called) {
    return val;
}
else {
    return rb_yield(val);
}

}

In cont_capture() the _setjmp is invoked via

if (ruby_setjmp(cont->jmpbuf)) {
    volatile VALUE value;

    value = cont->value;
    if (cont->argc == -1) rb_exc_raise(value);
    cont->value = Qnil;
    *stat = 1;
    return value;
}
else {
    *stat = 0;
    return cont->self;
}

So, after invoking setjmp, contcapture returns which, according to the man page, should invalidate the saved stack context. Later, when something in the rbyield(val) call chain tries to do the longjmp, we arrive at the location where _setjmp was originally called (in contcapture) with a smashed stack, resulting in a SIGBUS.

The fact that it works for other arches (and even on sparc, if I rebuild everything with -O0 instead of -O2) is somewhat surprising, but it might be that it just working "by accident" in most cases (i.e. the saved stack is preserved, even though it's not guaranteed).

#2 Updated by Yui NARUSE over 2 years ago

  • Status changed from Open to Assigned
  • Assignee set to Naohisa Goto

#3 Updated by Naohisa Goto over 2 years ago

  • Status changed from Assigned to Feedback

This cannot be reproduced on Solaris10 on sparc, with both Sun cc and gcc 4.4.3.
(using svn ruby19_3 branch r33165)

The bug might depend on OS (kernel), gcc and/or libc.
Which version of OS (kernel), gcc, and libc do you use?
Could you try rebuilding with -O0 option?

#4 Updated by Lucas Nussbaum over 2 years ago

You can find a full build log at https://buildd.debian.org/status/fetch.php?pkg=ruby1.9.1&arch=sparc&ver=1.9.3~preview1%2Bsvn33077-3&stamp=1314689360

Kernel: Linux lebrun 2.6.32-5-sparc64-smp #1 SMP Tue Jun 14 12:44:14 UTC 2011 sparc GNU/Linux
GNU Libc 2.13
GCC 4.6.1

Jurij Smakov said above that it works fine when built with -O0

#5 Updated by Jurij Smakov over 2 years ago

My kernel and toolchain versions are slightly different because I'm running Debian sid, but I don't think that's the issue here. The main problem is that the approach used in implementation of continuations is not guaranteed to work (even though it may work in vast majority of cases). Quoting http://en.wikipedia.org/wiki/Setjmp.h#Caveats_and_limitations :

"If the function in which setjmp was called returns, it is no longer possible to safely use longjmp with the corresponding jmp_buf object. This is because the stack frame is invalidated when the function returns. Calling longjmp restores the stack pointer, which—because the function returned—would point to a non-existent and potentially overwritten/corrupted stack frame.[4][5]"

I've verified that ruby_setjmp() translates to a simple _setjmp() in preprocessed code, and after calling it the function returns immediately.

#6 Updated by Shugo Maeda over 2 years ago

  • ruby -v changed from 1.9.3 to -

Hi,

2011/9/7 Jurij Smakov jurij@wooyd.org:

"If the function in which setjmp was called returns, it is no longer possible to safely use longjmp with the corresponding jmp_buf object. This is because the stack frame is invalidated when the function returns. Calling longjmp restores the stack pointer, which—because the function returned—would point to a non-existent and potentially overwritten/corrupted stack frame.[4][5]"

I've verified that ruby_setjmp() translates to a simple _setjmp() in preprocessed code, and after calling it the function returns immediately.

Ruby's callcc copies stack to heap, then calls setjmp(). Stack frames
are restored from the copy before longjmp().

--
Shugo Maeda

#7 Updated by Naohisa Goto over 2 years ago

The bug does not occur with older Debian sparc running on qemu.

  1. download image http://people.debian.org/~aurel32/qemu/sparc/debian_etch_sparc_small.qcow2
  2. qemu-system-sparc -hda debianetchsparc_small.qcow2 -M SS-10 -m 1G
  3. change /etc/apt/sources.list
  4. install subversion and many packages by using aptitude
  5. manually install yaml-0.1.4.tar.gz and libffi-3.0.10.tar.gz
  6. svn co http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_9_3 (fetched revision r33190)
  7. autoconf; ./configure --prefix=/home/user/ruby/193 optflags="-O2"
  8. make
  9. ./miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -rcontinuation -e 'callcc { |c| c.call }'
  10. finished with exit code 0

$ uname -a
Linux debian-sparc 2.6.18-6-sparc32 #1 Fri Dec 12 16:29:52 UTC 2008 sparc GNU/Linux
$ dpkg -l gcc libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name Version Description
+++-==============-==================-============================================
ii gcc 4.1.1-15 The GNU C compiler
ii libc6 2.3.6.ds1-13etch10 GNU C Library: Shared libraries

The bug might be specific to the svn revision 33077 and might have already been fixed in the latest svn revision.

#8 Updated by Lucas Nussbaum over 2 years ago

The problem is that it's quite hard to investigate this using qemu, because Debian dropped support for sparcv8 after etch, and qemu doesn't have working support for anything > sparcv8.

#9 Updated by Lucas Nussbaum over 2 years ago

I took a look on a Debian porterbox.

When building everything with -O2, except cont.c with -O0, it works.
With cont.c built with -O1, it fails.

However, what I don't understand is that building with -O0 and all the optimizations that are normally enabled with -O1 (determined using gcc -Q -O1 --help=optimizers), it works.

#10 Updated by Lucas Nussbaum over 2 years ago

It fails with gcc 4.4 and 4.5 too (in addition to 4.6).

#11 Updated by Jurij Smakov over 2 years ago

I looked at it some more (ruby1.9.1-1.9.3~preview1+svn33236 now), and tried to figure out what goes wrong by comparing the binaries compiled with -O0 and -O2. The call to rubylongjmp does not look suspicious, I've verified that in both cases comp->setjmpbuf gets correctly passed to the function. Eventually we get to the point where the actual long jump is performed, which is _longjmp in eglibc-2.13/sysdeps/sparc/sparc32/__longjmp.S. Actual jumping is done with the following code:

LOC(thread):
/*
* Do a "flush register windows trap". The trap handler in the
* kernel writes all the register windows to their stack slots, and
* marks them all as invalid (needing to be sucked up from the
* stack when used). This ensures that all information needed to
* unwind to these callers is in memory, not in the register
* windows.
/
ta STFLUSHWINDOWS
#ifdef PTRDEMANGLE
ld ENV(g1,JB
PC), %g5 /
Set return PC. /
ld ENV(g1,JB_SP), %g1 /
Set saved SP on restore below. /
PTRDEMANGLE2 (%o7, %g5, %g4)
PTR
DEMANGLE2 (%fp, %g1, %g4)
#else
ld ENV(g1,JB_PC), %o7 /
Set return PC. /
ld ENV(g1,JB_SP), %fp /
Set saved SP on restore below. /
#endif
sub %fp, 64, %sp /
Allocate a register frame. /
st %g3, RW_FP /
Set saved FP on restore below. /
retl
restore %g2, 0, %o0 /
Restore values from above register frame. */

I've verified that in both cases the value of %o7 which is used by retl (it's essentially %o7 + 8) is correct, pointing to the address from where setjmp has been previously called. However, in the optimized case (built with -O2) something goes wrong with the register frame restore (which is executed in retl delay slot), and we jump to the correct address, but with an obviously broken value of 0x5 in %fp, which eventually leads to a SIGBUS once we start dereferencing memory with it. I'll need to do quite a bit of reading here to understand why the broken values end up on the register frame, so it may take a while.

#12 Updated by Jurij Smakov over 2 years ago

Discussion of this issue is ongoing in this thread on sparclinux mailing list: http://marc.info/?t=131806608400002&r=1&w=2.

#13 Updated by Jurij Smakov over 2 years ago

I think we figured it out. The problem arises in contsavemachinestack() function, where the register windows are flushed using 'flushw' assembler instruction, and the machine stack is then saved by memcpy'ing it from cont->machinestacksrc to cont->machinestack. However, 'flushw' does not flush the current register window, so we end up copying incorrect memory contents, because the source address lies withing the last stack frame. For a detailed analysis see my message:

http://article.gmane.org/gmane.linux.ports.sparc/15410

and David Miller's suggestions for fixing it:

http://article.gmane.org/gmane.linux.ports.sparc/15411

If you decide to follow the first suggestion (replacing 'flushw' with 'ta 0x03'), attached patch implements it. I've just ran a successful build with it applied on my SunBlade 1000 machine.

#14 Updated by Nobuyoshi Nakada over 2 years ago

  • Status changed from Feedback to Open

At r3285, defined(FreeBSD) was lost.
I have no idea if 'flushw' is preferable to 'ta 0x03' on FreeBSD.

#15 Updated by Nobuyoshi Nakada over 2 years ago

knu says that 'flushw' is correct for SparcV9, but not 'ta 0x03'.
And your platform seems 32bit, right?
Then why defined(sparc_v9) || defined(sparcv9) || defined(arch64__) is true?

Can't you show the result from:
$ gcc -E -dM -xc /dev/null | grep -i -e sparc -e arch64

#16 Updated by Jurij Smakov over 2 years ago

My machine is UltraSparc III based, so it's a v9 and 64-bit. For historical reasons though Debian is using 64-bit kernel and 32-bit userspace:

jurij@debian:~$ gcc -E -dM -xc /dev/null | grep -i -e sparc -e arch64
#define sparc 1
#define sparc 1
#define sparc 1
#define _sparcv9
1
jurij@debian:~$

For the purposes of the continuation code it's not appropriate to say that 'flushw' is correct for sparcv9, as 'flushw' has slightly different effect compared to 'ta 0x03' (at least, on sparc/linux). It appears that Ruby wants to save the entire stack, including the current stack frame, and relies on the contents of current register window being flushed before that. Well, 'flushw' is going to flush all windows except the current one, this behavior is described in Sparc Architecture Manual (version 9). On the other hand, 'ta 0x03' is a software trap, which flushes all register windows of the process invoking the trap (including the current one), and that's the behavior wanted here.

#17 Updated by Naohisa Goto over 2 years ago

I think a possible workaroud to distinguish 32-bit on Debian Sparc (or on Sparc Linux) is to check SIZEOF_VOIDP (value of sizeof(void *) set by configure) in addition to sparc specific macros.

FYI, on Solars10, Sun compiler (Sun Studio, Oracle Solaris Studio) defines _sparcv9 only when generating 64-bit code for SPARC V9.
http://developers.sun.com/sunstudio/documentation/ss12u1/mr/READMEs/c++
faq.html#Vers6

GCC on Solaris 10 also does so.

$ gcc-4.4 -E -dM -xc /dev/null | grep -i -e sparc -e arch64
#define sparc 1
#define sparc 1
#define sparcv8 1
#define _sparc 1
$ gcc-4.4 -m64 -E -dM -xc /dev/null | grep -i -e sparc -e arch64
#define sparc 1
#define _
sparc
1
#define sparcv9 1
#define _sparc 1
#define _
arch64
1

#18 Updated by Nobuyoshi Nakada over 2 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r33492.
Lucas, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • include/ruby/defines.h (flushregisterwindows): use software trap on Debian Sparc 32-bit userspace. [Bug #5244]

#19 Updated by Jurij Smakov over 2 years ago

Sorry, but this is not a proper fix. While it will fix the immediate problem for Debian, other systems will still be affected. Out of curiosity I tried building the latest svn snapshot (including this fix) on a freebsd/sparc64 system, and, sure enough, it still crashes there:

$ uname -a
FreeBSD free.wooyd.org 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 06:57:44 UTC 2011 root@araz.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC sparc64
$ export RUBYLIB=/usr/home/jurij/snapshot:/usr/home/jurij/snapshot/.ext/common:/usr/home/jurij/snapshot/.ext/sparc64-freebsd8.2:/usr/home/jurij/snapshot/lib
$ ./ruby -rcontinuation -e 'callcc { |c| c.call }'
-e:1: [BUG] Segmentation fault
ruby 2.0.0dev (2011-10-22 trunk 33503) [sparc64-freebsd8.2]

-- Control frame information -----------------------------------------------
c:0004 p:---- s:0009 b:0009 l:000008 d:000008 CFUNC :callcc
c:0003 p:0009 s:0006 b:0006 l:0007f8 d:000fb8 EVAL -e:1
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:0007f8 d:0007f8 TOP

-- Ruby level backtrace information ----------------------------------------
-e:1:in <main>'
-e:1:in
callcc'

-- Other runtime information -----------------------------------------------

  • Loaded script: -e

  • Loaded features:

    0 enumerator.so
    1 /usr/home/jurij/snapshot/.ext/sparc64-freebsd8.2/enc/encdb.so
    2 /usr/home/jurij/snapshot/.ext/sparc64-freebsd8.2/enc/trans/transdb.so
    3 /usr/home/jurij/snapshot/lib/rubygems/defaults.rb
    4 /usr/home/jurij/snapshot/rbconfig.rb
    5 /usr/home/jurij/snapshot/lib/rubygems/deprecate.rb
    6 /usr/home/jurij/snapshot/lib/rubygems/exceptions.rb
    7 /usr/home/jurij/snapshot/lib/rubygems/custom_require.rb
    8 /usr/home/jurij/snapshot/lib/rubygems.rb
    9 /usr/home/jurij/snapshot/.ext/sparc64-freebsd8.2/continuation.so

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html

Abort trap (core dumped)
$ gdb
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "sparc64-marcel-freebsd".
(gdb) file ./ruby
Reading symbols from ./ruby...done.
(gdb) run -rcontinuation -e 'callcc { |c| c.call }'
Starting program: /usr/home/jurij/snapshot/ruby -rcontinuation -e 'callcc { |c| c.call }'

Program received signal SIGSEGV, Segmentation fault.
0x0000000040a41368 in _sparcutrapinstall () from /lib/libc.so.7
(gdb) bt
#0 0x0000000040a41368 in _
sparcutrapinstall () from /lib/libc.so.7
#1 0x0000000040a4148c in _sparcutrapinstall () from /lib/libc.so.7
#2 0x0000000040a41730 in _
sparcutrapinstall () from /lib/libc.so.7
#3 0x0000000040a40f6c in _sparcutrapinstall () from /lib/libc.so.7
#4 0x00000000002489b4 in cont
capture (stat=Error accessing memory address 0x882: Bad address.
) at cont.c:440
Previous frame inner to this frame (corrupt stack?)
(gdb) The program is running. Exit anyway? (y or n) y
$

Unfortunately, using 'ta 0x03' instead of 'flushw' is not an option, as it causes an illegal instruction trap.

#20 Updated by Jurij Smakov over 2 years ago

Attached is a patch for this problem, fixing the issue by moving the windows-flushing instruction into a separate function on sparc. It will still use flushw any sparcv9-capable machine irrespective of the OS. I've verified that it fixes the crashes on both Debian/sparc/unstable and FreeBSD/sparc64/8.2. Thanks a lot to David Miller for explaining how to do it correctly.

#21 Updated by Lucas Nussbaum over 2 years ago

Dear Ruby developers,

Could you follow up on this issue?
The fix that was commited is not correct, as explained in comment #19. A correct fix is in comment #20.

Also, could you backport this to the ruby193 branch?

#22 Updated by Motohiro KOSAKI over 2 years ago

  • Status changed from Closed to Assigned

Goto-san, ping?

#23 Updated by Naohisa Goto over 2 years ago

Sorry for delay. I'll try on it Solaris10 with SUN cc, Fujitsu fcc and gcc, with 32 and 64-bit compiler options.

#24 Updated by Naohisa Goto over 2 years ago

In Solaris10, with Sun Studio 11 cc, with 64-bit compile option -xarch=v9, compile error occur with the patch.

compiling sparc.c
"sparc.c", line 21: syntax error before or at: :
cc: acomp failed for sparc.c
make: *** [sparc.o] Error 2

It seems that the description ("flushw" : : : "%o7") is not portable.

#25 Updated by Jurij Smakov over 2 years ago

Does it work if you replace ("flushw" : : : "%o7") with just ("flushw")? If it is, then it just has to be protected by #ifdef GNUC, i.e. the body of the function should look something like this (untested):

void flushsparcregisterwindows()
{
asm
#if defined(
sparcv9) || defined(sparcv9__)

ifdef GNUC

__volatile__ ("flushw" : : : "%o7")

else

("flushw")

endif /* GNUC */

#else
("ta 0x03")
#endif
;
}

As long as the asm instructions remain in a separate function, it should still work.

#26 Updated by Naohisa Goto over 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r33757.
Lucas, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • include/ruby/defines.h (FLUSHREGISTERWINDOWS): move sparc asm code to a separete file sparc.c for preventing inlining optimization. Patched by Jurij Smakov. [Bug #5244]
  • sparc.c (rbsparcflushregisterwindows): ditto.
  • configure.in: ditto.

#27 Updated by Naohisa Goto over 2 years ago

The patch is applied to trunk as r33757, r33758 (and r33760) with little modification.
Backport request to ruby 1.9.3 is submitted as #5636.
http://redmine.ruby-lang.org/issues/show/5636

Also available in: Atom PDF