Project

General

Profile

Bug #14480

miniruby crashing when compiled with -O2 or -O1 on aarch64

Added by vo.x (Vit Ondruch) about 1 year ago. Updated about 1 year ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.5.0p0 (2017-12-25 revision 61468) [aarch64-linux]
[ruby-core:85597]

Description

Recently, it is not possible to build Ruby 2.5.0 on aarch64 on Fedora Rawhide, because miniruby fails during build:

... snip ...

./miniruby -I./lib -I. -I.ext/common  -n \
-e 'BEGIN{version=ARGV.shift;mis=ARGV.dup}' \
-e 'END{abort "UNICODE version mismatch: #{mis}" unless mis.empty?}' \
-e '(mis.delete(ARGF.path); ARGF.close) if /ONIG_UNICODE_VERSION_STRING +"#{Regexp.quote(version)}"/o' \
10.0.0 ./enc/unicode/10.0.0/casefold.h ./enc/unicode/10.0.0/name2ctype.h 
generating encdb.h
./miniruby -I./lib -I. -I.ext/common  ./tool/generic_erb.rb -c -o encdb.h ./template/encdb.h.tmpl ./enc enc
generating prelude.c
./miniruby -I./lib -I. -I.ext/common  ./tool/generic_erb.rb -I. -c -o prelude.c \
    ./template/prelude.c.tmpl ./prelude.rb ./gem_prelude.rb ./abrt_prelude.rb
*** stack smashing detected ***: <unknown> terminated
encdb.h updated

... snip ...

This might by Ruby or gcc issue. Not sure yet. However, there is already lengthy analysis available in Fedora's Bugzilla 1. Would be anybody able to help to resolve this issue?


Files

Dockerfile (573 Bytes) Dockerfile wanabe (_ wanabe), 02/21/2018 12:49 AM

Related issues

Related to Ruby trunk - Bug #5407: Cannot build ruby-1.9.3-rc1 with TDM-GCC 4.6.1 on Windows XP SP3Closed10/05/2011Actions
Related to Ruby trunk - Bug #9710: __builtin_setjmp/longjmp causes SEGV with mingwClosed04/07/2014Actions

History

#1

Updated by wanabe (_ wanabe) about 1 year ago

  • Related to Bug #5407: Cannot build ruby-1.9.3-rc1 with TDM-GCC 4.6.1 on Windows XP SP3 added

Updated by wanabe (_ wanabe) about 1 year ago

I confirmed it can be reproduced with docker + qemu-aarch64-static + binfmt.
Dockerfile is attached.
(Note that it requires a copy of your /usr/bin/qemu-aarch64-static in working directory)

I also confirmed the issue is avoidable with -fno-omit-frame-pointer workaround as commented on https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c19.
But I expect gcc experts will resolve this. See https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c26.

Updated by vo.x (Vit Ondruch) about 1 year ago

It seems they are getting further:

With -fomit-frame-pointer on *everything*, and hacking out the call to rb_thread_create_timer_thread in Init_Thread (to keep this single-threaded for simplicity), the bug appears to be a problem with setjmp/longjmp.

x29 (the frame pointer) is corrupted deep within the 15th call to vm_exec_core within the 10th call to vm_call_opt_call.

The write of the bogus value to x29 occurs here:

0x000000000057d9bc in rb_ec_tag_jump (ec=0x46b050 <all_iter_i>, st=RUBY_TAG_NONE) at vm.i:10459
10459     __builtin_longjmp(((ec->tag->buf)), (1));
2: /x $x29 = 0x7fffff8080

(gdb) disassemble 
Dump of assembler code for function rb_ec_tag_jump:
   0x000000000057d988 <+0>:     stp     x29, x30, [sp,#-32]!
   0x000000000057d98c <+4>:     mov     x29, sp
   0x000000000057d990 <+8>:     str     x0, [x29,#24]
   0x000000000057d994 <+12>:    str     w1, [x29,#20]
   0x000000000057d998 <+16>:    ldr     x0, [x29,#24]
   0x000000000057d99c <+20>:    ldr     x0, [x0,#24]
   0x000000000057d9a0 <+24>:    ldr     w1, [x29,#20]
   0x000000000057d9a4 <+28>:    str     w1, [x0,#336]
   0x000000000057d9a8 <+32>:    ldr     x0, [x29,#24]
   0x000000000057d9ac <+36>:    ldr     x0, [x0,#24]
   0x000000000057d9b0 <+40>:    add     x0, x0, #0x10
   0x000000000057d9b4 <+44>:    ldr     x1, [x0,#8]
   0x000000000057d9b8 <+48>:    ldr     x29, [x0]
=> 0x000000000057d9bc <+52>:    ldr     x0, [x0,#16]
   0x000000000057d9c0 <+56>:    mov     sp, x0
   0x000000000057d9c4 <+60>:    br      x1

when called from rb_iterate0, where the bogus x29 value has been fetched from the jmp_buf at +48.

A watchpoint on that memory shows it being set to the bogus value here in rb_iterate0:

   0x0000000000595e50 <+96>:    str     x0, [sp,#264]
   0x0000000000595e54 <+100>:   ldr     x0, [sp,#232]
   0x0000000000595e58 <+104>:   ldr     x0, [x0,#24]
   0x0000000000595e5c <+108>:   str     x0, [sp,#592]
   0x0000000000595e60 <+112>:   add     x0, sp, #0x108
   0x0000000000595e64 <+116>:   add     x0, x0, #0x10
   0x0000000000595e68 <+120>:   add     x1, sp, #0x260
   0x0000000000595e6c <+124>:   str     x1, [x0]
=> 0x0000000000595e70 <+128>:   adrp    x1, 0x595000 <raise_method_missing+544>

28300       struct rb_vm_tag _tag;
28301       _tag.state = RUBY_TAG_NONE;
28302       _tag.tag = ((VALUE)RUBY_Qundef);
28303       _tag.prev = _ec->tag;
28304       ;
28305       state = (__builtin_setjmp((_tag.buf)) ? rb_ec_tag_state((_ec))
28306                                             : ((void)(_ec->tag = &_tag), 0));

If I'm reading this right, the __builtin_longjmp rb_ec_tag_jump (in rb_iterate0) is attempting to restore x29 from the jmp_buf, but the __builtin_setjmp in rb_iterate0 isn't actually saving x29 there, and hence x29 gets corrupted at the longjmp, deep in the callstack, leading to an eventual crash when vm_call_opt_call eventually tries to use x29.

Updated by vo.x (Vit Ondruch) about 1 year ago

This appears to be longstanding GCC issue on aarch64, which was recently exposed by change in GCC defaults 1. However, the GCC upstream is questioning usage of __builtin_setjmp/__builtin_longjmp. Can somebody provide answer to this question 2:

I notice that in our builds, RUBY_SETJMP is using "__builtin_setjmp", rather
than "setjmp".

This seems to come from the "configure" check; the rpm build log has:
"checking for setjmp type... __builtin_setjmp"

Is this the default for upstream Ruby?

A GCC upstream developer notes:

To me any use of __builtin_setjmp/__builtin_longjmp is almost always incorrect.
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521#c3)

#5

Updated by wanabe (_ wanabe) about 1 year ago

  • Related to Bug #9710: __builtin_setjmp/longjmp causes SEGV with mingw added

Updated by wanabe (_ wanabe) about 1 year ago

tl;dr:

  • rb_ec_tag_jump() causes SEGV on aarch64 when it is built with -fomit-frame-pointer optimization that is the default of -O1 of gcc-8.0.1-0.13.
    • It is confirmed that -fno-omit-frame-pointer optflags can avoid the issue.
    • Also --with-setjmp-type=setjmp will do, but no one has yet confirmed.
  • Andrew Pinski, one of gcc maintainer, says "To me any use of __builtin_setjmp/__builtin_longjmp is almost always incorrect."

vo.x (Vit Ondruch) wrote:

This appears to be longstanding GCC issue on aarch64, which was recently exposed by change in GCC defaults [1]. However, the GCC upstream is questioning usage of __builtin_setjmp/__builtin_longjmp. Can somebody provide answer to this question [2]:

I notice that in our builds, RUBY_SETJMP is using "__builtin_setjmp", rather
than "setjmp".

This seems to come from the "configure" check; the rpm build log has:
"checking for setjmp type... __builtin_setjmp"

Is this the default for upstream Ruby?

Yes, it is, as far as I know since r15871 ruby-core:16086.

Nakada-san, how do you think about this issue?

Updated by nobu (Nobuyoshi Nakada) about 1 year ago

It sounds good to fix by --with-setjmp-type, if it works.

Updated by vo.x (Vit Ondruch) about 1 year ago

I tried --with-setjmp-type=setjmp and the build passed. But I have no idea what is the performance impact.

Nevertheless, there are two more comments from GCC people. The first is 1:

Jeff Law 2018-02-26 17:01:13 CET

In general I would not expect applications to make direct use of the builtin setjmp/longjmp routines within GCC. They should instead be using the OS provided setjmp/longjmp.

The builtin setjmp/longjmp are primarily for the use of the exception handling system on a small number of targets that do not have sufficient unwinding mechansisms.

Based on this, I'd say you should reconsider change of the defaults.

On the other hand, GCC people already reverted the change responsible for the issue 2:

Dave Malcolm 2018-02-26 17:30:10 CET

The bug was worked around upstream today by commit r257984 (by changing the default on aarch64 back to -fno-omit-frame-pointer i.e. keep the frame pointer):
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=257984

But they will continue to discuss what is the right solution here ...

Updated by nobu (Nobuyoshi Nakada) about 1 year ago

vo.x (Vit Ondruch) wrote:

I tried --with-setjmp-type=setjmp and the build passed. But I have no idea what is the performance impact.

Thank you for checking.

Nevertheless, there are two more comments from GCC people. The first is [1]:

Jeff Law 2018-02-26 17:01:13 CET

In general I would not expect applications to make direct use of the builtin setjmp/longjmp routines within GCC. They should instead be using the OS provided setjmp/longjmp.

The builtin setjmp/longjmp are primarily for the use of the exception handling system on a small number of targets that do not have sufficient unwinding mechansisms.

Yes, they are used for the exception handling in Ruby :)

Updated by wanabe (_ wanabe) about 1 year ago

  • Status changed from Open to Third Party's Issue

GCC upstream changed aarch64 default behaviour on its revision 257984.
https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/common/config/aarch64/aarch64-common.c?r1=257984&r2=257983&pathrev=257984

Fundamental frame pointer corruption may be fixed some day.
https://gcc.gnu.org/ml/gcc-patches/2018-03/msg00668.html

So ruby don't have to implement a workaround about the issue.
Even if someone encounters this, updating gcc latest or -fno-omit-frame-pointer or --with-setjmp-type=setjmp may help.

Thanks.

Updated by vo.x (Vit Ondruch) about 1 year ago

Just forwarding one remark from RH Bugzilla 0:

--- Comment #44 from Dave Malcolm <dmalcolm@redhat.com> ---
> should we keep the "setjmp" for AArch64? I am asking,
> since it seems that Ruby upstream is not going to change anything [1], so we
> should somehow officially resolve/close the issue.
>
>
>
> [1] https://bugs.ruby-lang.org/issues/14480#note-10

Note that the workaround in the gcc rpm is papering over the issue, albeit a
long-standing one: that __builtin_setjmp on aarch64 doesn't properly save the
frame pointer, leading to clobbering of the frame pointer when
__builtin_longjmp is used, hence leading to issues when used in conjunction
with -fomit-frame-pointer.

If upstream Ruby want to use __builtin_setjmp as a performance optimization,
that's up to them, I guess, but it's relying on none of the code ever using or
interacting with -fomit-frame-pointer (until PR target/84521 is properly
fixed).

As for Ruby on AArch64 Fedora, I'll keep using "setjmp" unless somebody has some convincing arguments :)

Updated by wanabe (_ wanabe) about 1 year ago

  • Status changed from Third Party's Issue to Open

Sorry for the confusion, I am not one of "upstream".
So I revert the issue status and there are no upstream opinion.
I am sorry again.

Also available in: Atom PDF