Project

General

Profile

Actions

Bug #14480

closed

miniruby crashing when compiled with -O2 or -O1 on aarch64

Added by vo.x (Vit Ondruch) almost 7 years ago. Updated 7 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.5.0p0 (2017-12-25 revision 61468) [aarch64-linux]
[ruby-core:85597]
Tags:

Description

Recently, it is not possible to build Ruby 2.5.0 on aarch64 on Fedora Rawhide, because miniruby fails during build:

... snip ...

./miniruby -I./lib -I. -I.ext/common  -n \
-e 'BEGIN{version=ARGV.shift;mis=ARGV.dup}' \
-e 'END{abort "UNICODE version mismatch: #{mis}" unless mis.empty?}' \
-e '(mis.delete(ARGF.path); ARGF.close) if /ONIG_UNICODE_VERSION_STRING +"#{Regexp.quote(version)}"/o' \
10.0.0 ./enc/unicode/10.0.0/casefold.h ./enc/unicode/10.0.0/name2ctype.h 
generating encdb.h
./miniruby -I./lib -I. -I.ext/common  ./tool/generic_erb.rb -c -o encdb.h ./template/encdb.h.tmpl ./enc enc
generating prelude.c
./miniruby -I./lib -I. -I.ext/common  ./tool/generic_erb.rb -I. -c -o prelude.c \
	./template/prelude.c.tmpl ./prelude.rb ./gem_prelude.rb ./abrt_prelude.rb
*** stack smashing detected ***: <unknown> terminated
encdb.h updated

... snip ...

This might by Ruby or gcc issue. Not sure yet. However, there is already lengthy analysis available in Fedora's Bugzilla 1. Would be anybody able to help to resolve this issue?


Files

Dockerfile (573 Bytes) Dockerfile wanabe (_ wanabe), 02/21/2018 12:49 AM

Related issues 3 (0 open3 closed)

Related to Ruby master - Bug #5407: Cannot build ruby-1.9.3-rc1 with TDM-GCC 4.6.1 on Windows XP SP3Closednobu (Nobuyoshi Nakada)10/05/2011Actions
Related to Ruby master - Bug #9710: __builtin_setjmp/longjmp causes SEGV with mingwClosed04/07/2014Actions
Related to Ruby master - Bug #17511: Segmentation fault when compiled with -O2 or higher on ARM AndroidClosedActions
Actions #1

Updated by wanabe (_ wanabe) almost 7 years ago

  • Related to Bug #5407: Cannot build ruby-1.9.3-rc1 with TDM-GCC 4.6.1 on Windows XP SP3 added

Updated by wanabe (_ wanabe) almost 7 years ago

I confirmed it can be reproduced with docker + qemu-aarch64-static + binfmt.
Dockerfile is attached.
(Note that it requires a copy of your /usr/bin/qemu-aarch64-static in working directory)

I also confirmed the issue is avoidable with -fno-omit-frame-pointer workaround as commented on https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c19.
But I expect gcc experts will resolve this. See https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c26.

Updated by vo.x (Vit Ondruch) almost 7 years ago

It seems they are getting further:

With -fomit-frame-pointer on *everything*, and hacking out the call to rb_thread_create_timer_thread in Init_Thread (to keep this single-threaded for simplicity), the bug appears to be a problem with setjmp/longjmp.

x29 (the frame pointer) is corrupted deep within the 15th call to vm_exec_core within the 10th call to vm_call_opt_call.

The write of the bogus value to x29 occurs here:

0x000000000057d9bc in rb_ec_tag_jump (ec=0x46b050 <all_iter_i>, st=RUBY_TAG_NONE) at vm.i:10459
10459     __builtin_longjmp(((ec->tag->buf)), (1));
2: /x $x29 = 0x7fffff8080

(gdb) disassemble 
Dump of assembler code for function rb_ec_tag_jump:
   0x000000000057d988 <+0>:     stp     x29, x30, [sp,#-32]!
   0x000000000057d98c <+4>:     mov     x29, sp
   0x000000000057d990 <+8>:     str     x0, [x29,#24]
   0x000000000057d994 <+12>:    str     w1, [x29,#20]
   0x000000000057d998 <+16>:    ldr     x0, [x29,#24]
   0x000000000057d99c <+20>:    ldr     x0, [x0,#24]
   0x000000000057d9a0 <+24>:    ldr     w1, [x29,#20]
   0x000000000057d9a4 <+28>:    str     w1, [x0,#336]
   0x000000000057d9a8 <+32>:    ldr     x0, [x29,#24]
   0x000000000057d9ac <+36>:    ldr     x0, [x0,#24]
   0x000000000057d9b0 <+40>:    add     x0, x0, #0x10
   0x000000000057d9b4 <+44>:    ldr     x1, [x0,#8]
   0x000000000057d9b8 <+48>:    ldr     x29, [x0]
=> 0x000000000057d9bc <+52>:    ldr     x0, [x0,#16]
   0x000000000057d9c0 <+56>:    mov     sp, x0
   0x000000000057d9c4 <+60>:    br      x1

when called from rb_iterate0, where the bogus x29 value has been fetched from the jmp_buf at +48.
 
A watchpoint on that memory shows it being set to the bogus value here in rb_iterate0:

   0x0000000000595e50 <+96>:    str     x0, [sp,#264]
   0x0000000000595e54 <+100>:   ldr     x0, [sp,#232]
   0x0000000000595e58 <+104>:   ldr     x0, [x0,#24]
   0x0000000000595e5c <+108>:   str     x0, [sp,#592]
   0x0000000000595e60 <+112>:   add     x0, sp, #0x108
   0x0000000000595e64 <+116>:   add     x0, x0, #0x10
   0x0000000000595e68 <+120>:   add     x1, sp, #0x260
   0x0000000000595e6c <+124>:   str     x1, [x0]
=> 0x0000000000595e70 <+128>:   adrp    x1, 0x595000 <raise_method_missing+544>

28300       struct rb_vm_tag _tag;
28301       _tag.state = RUBY_TAG_NONE;
28302       _tag.tag = ((VALUE)RUBY_Qundef);
28303       _tag.prev = _ec->tag;
28304       ;
28305       state = (__builtin_setjmp((_tag.buf)) ? rb_ec_tag_state((_ec))
28306                                             : ((void)(_ec->tag = &_tag), 0));

If I'm reading this right, the __builtin_longjmp rb_ec_tag_jump (in rb_iterate0) is attempting to restore x29 from the jmp_buf, but the __builtin_setjmp in rb_iterate0 isn't actually saving x29 there, and hence x29 gets corrupted at the longjmp, deep in the callstack, leading to an eventual crash when vm_call_opt_call eventually tries to use x29.

Updated by vo.x (Vit Ondruch) almost 7 years ago

This appears to be longstanding GCC issue on aarch64, which was recently exposed by change in GCC defaults 1. However, the GCC upstream is questioning usage of __builtin_setjmp/__builtin_longjmp. Can somebody provide answer to this question 2:

I notice that in our builds, RUBY_SETJMP is using "__builtin_setjmp", rather
than "setjmp".

This seems to come from the "configure" check; the rpm build log has:
"checking for setjmp type... __builtin_setjmp"

Is this the default for upstream Ruby?

A GCC upstream developer notes:

To me any use of __builtin_setjmp/__builtin_longjmp is almost always incorrect.
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521#c3)

Actions #5

Updated by wanabe (_ wanabe) almost 7 years ago

  • Related to Bug #9710: __builtin_setjmp/longjmp causes SEGV with mingw added

Updated by wanabe (_ wanabe) almost 7 years ago

tl;dr:

  • rb_ec_tag_jump() causes SEGV on aarch64 when it is built with -fomit-frame-pointer optimization that is the default of -O1 of gcc-8.0.1-0.13.
    • It is confirmed that -fno-omit-frame-pointer optflags can avoid the issue.
    • Also --with-setjmp-type=setjmp will do, but no one has yet confirmed.
  • Andrew Pinski, one of gcc maintainer, says "To me any use of __builtin_setjmp/__builtin_longjmp is almost always incorrect."

vo.x (Vit Ondruch) wrote:

This appears to be longstanding GCC issue on aarch64, which was recently exposed by change in GCC defaults [1]. However, the GCC upstream is questioning usage of __builtin_setjmp/__builtin_longjmp. Can somebody provide answer to this question [2]:

I notice that in our builds, RUBY_SETJMP is using "__builtin_setjmp", rather
than "setjmp".

This seems to come from the "configure" check; the rpm build log has:
"checking for setjmp type... __builtin_setjmp"

Is this the default for upstream Ruby?

Yes, it is, as far as I know since r15871 ruby-core:16086.

Nakada-san, how do you think about this issue?

Updated by nobu (Nobuyoshi Nakada) almost 7 years ago

It sounds good to fix by --with-setjmp-type, if it works.

Updated by vo.x (Vit Ondruch) almost 7 years ago

I tried --with-setjmp-type=setjmp and the build passed. But I have no idea what is the performance impact.

Nevertheless, there are two more comments from GCC people. The first is 1:

Jeff Law 2018-02-26 17:01:13 CET

In general I would not expect applications to make direct use of the builtin setjmp/longjmp routines within GCC. They should instead be using the OS provided setjmp/longjmp.

The builtin setjmp/longjmp are primarily for the use of the exception handling system on a small number of targets that do not have sufficient unwinding mechansisms.

Based on this, I'd say you should reconsider change of the defaults.

On the other hand, GCC people already reverted the change responsible for the issue 2:

Dave Malcolm 2018-02-26 17:30:10 CET

The bug was worked around upstream today by commit r257984 (by changing the default on aarch64 back to -fno-omit-frame-pointer i.e. keep the frame pointer):
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=257984

But they will continue to discuss what is the right solution here ...

Updated by nobu (Nobuyoshi Nakada) almost 7 years ago

vo.x (Vit Ondruch) wrote:

I tried --with-setjmp-type=setjmp and the build passed. But I have no idea what is the performance impact.

Thank you for checking.

Nevertheless, there are two more comments from GCC people. The first is [1]:

Jeff Law 2018-02-26 17:01:13 CET

In general I would not expect applications to make direct use of the builtin setjmp/longjmp routines within GCC. They should instead be using the OS provided setjmp/longjmp.

The builtin setjmp/longjmp are primarily for the use of the exception handling system on a small number of targets that do not have sufficient unwinding mechansisms.

Yes, they are used for the exception handling in Ruby :)

Updated by wanabe (_ wanabe) over 6 years ago

  • Status changed from Open to Third Party's Issue

GCC upstream changed aarch64 default behaviour on its revision 257984.
https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/common/config/aarch64/aarch64-common.c?r1=257984&r2=257983&pathrev=257984

Fundamental frame pointer corruption may be fixed some day.
https://gcc.gnu.org/ml/gcc-patches/2018-03/msg00668.html

So ruby don't have to implement a workaround about the issue.
Even if someone encounters this, updating gcc latest or -fno-omit-frame-pointer or --with-setjmp-type=setjmp may help.

Thanks.

Updated by vo.x (Vit Ondruch) over 6 years ago

Just forwarding one remark from RH Bugzilla 0:

--- Comment #44 from Dave Malcolm <dmalcolm@redhat.com> ---
> should we keep the "setjmp" for AArch64? I am asking,
> since it seems that Ruby upstream is not going to change anything [1], so we
> should somehow officially resolve/close the issue.
>
>
>
> [1] https://bugs.ruby-lang.org/issues/14480#note-10

Note that the workaround in the gcc rpm is papering over the issue, albeit a
long-standing one: that __builtin_setjmp on aarch64 doesn't properly save the
frame pointer, leading to clobbering of the frame pointer when
__builtin_longjmp is used, hence leading to issues when used in conjunction
with -fomit-frame-pointer.

If upstream Ruby want to use __builtin_setjmp as a performance optimization,
that's up to them, I guess, but it's relying on none of the code ever using or
interacting with -fomit-frame-pointer (until PR target/84521 is properly
fixed).

As for Ruby on AArch64 Fedora, I'll keep using "setjmp" unless somebody has some convincing arguments :)

Updated by wanabe (_ wanabe) over 6 years ago

  • Status changed from Third Party's Issue to Open

Sorry for the confusion, I am not one of "upstream".
So I revert the issue status and there are no upstream opinion.
I am sorry again.

Actions #13

Updated by mame (Yusuke Endoh) almost 4 years ago

  • Related to Bug #17511: Segmentation fault when compiled with -O2 or higher on ARM Android added

Updated by mame (Yusuke Endoh) almost 4 years ago

Looks like the status of __builtin_setjmp is very complicated:

The documentation of GCC says:

GCC provides the built-in functions __builtin_setjmp and __builtin_longjmp which are similar to, but not interchangeable with, the C library functions setjmp and longjmp. The built-in versions are used internally by GCC’s libraries to implement exception handling on some targets. You should use the standard C library functions declared in <setjmp.h> in user code instead of the builtins.

An important caveat is that GCC arranges to save and restore only those registers known to the specific architecture variant being compiled for. This can make __builtin_setjmp and __builtin_longjmp more efficient than their library counterparts in some cases, but it can also cause incorrect and mysterious behavior when mixing with code that uses the full register set.

So, in general, it looks good to avoid using __builtin_setjmp/__builtin_longjmp in Ruby.

It was introduced for a performance reason on Mac OS X in 2008.

http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/16023?15980-16545+split-mode-vertical

I'm unsure if this performance improvement on macOS is still significant. Anyone is interested in measuring the peformance?

Actually, using __builtin_setjmp has been very troublesome. It is explicitly disabled on some architectures/conditions:

  • AIX: d612c44dad8a5d4c373b0c4bc08a99e3dc06498e
  • ppc64* LInux: 67fcbf9328614379fee28bed83ae6749ce7a3dda
  • when -fcf-protection is used: #15307
  • GCC 4 or before: #10272

And looks like we need to disable it on Android too. #17511

But interestingly, it is explicitly enabled on mingw64 because it was apparently needed to fix an issue: #15348.

Updated by xtkoba (Tee KOBAYASHI) almost 4 years ago

The main issue might be resolved by the following commit to GCC:
https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=25403c416e5f12d681d1fc45a8789d19ab40297f
(see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521#c29)

IMHO, the use of __builtin_{set,long}jmp should be opt-in. They are probably safe to use in recent GCC versions, but may not in Clang/LLVM (or other compilers than GCC).

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

I submitted a pull request to disable __builtin_setjmp for Linux aarch64: https://github.com/ruby/ruby/pull/8272

Updated by vo.x (Vit Ondruch) about 1 year ago

Please note that on Fedora this was not issue for past ~5 years. This has beend workarounded on GCC side:

https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c42

And it seems that the GCC folks position is that this should behave consistently everywhere. So I am not sure why the PR above extends the special casing of some platforms instead of removing it.

Actions #18

Updated by jeremyevans (Jeremy Evans) 7 months ago

  • Status changed from Open to Closed

Applied in changeset git|029d92b8988d26955d0622f0cbb8ef3213200749.


Disable __builtin_setjmp usage on linux-aarch64

It is questionable whether __builtin_setjmp should default to yes
at all, but since it appears to still have problems on this platform,
it seems safest to disable it.

Fixes [Bug #14480]

Updated by vo.x (Vit Ondruch) 7 months ago

I don't understand why this change was applied. This should be either enabled everywhere or disabled everywhere, not enabled on some random platforms. Please note that this should not be issue on aarch64 for a while.

Or was the issue exhibited somewhere recently? The commit does not elaborate about it.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0