Project

General

Profile

Actions

Feature #16837

closed

Can we make Ruby 3.0 as fast as Ruby 2.7 with the new assertions?

Feature #16837: Can we make Ruby 3.0 as fast as Ruby 2.7 with the new assertions?

Added by k0kubun (Takashi Kokubun) over 5 years ago. Updated over 5 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:98174]

Description

Problem

How can we make Ruby 3.0 as fast as (or faster than) Ruby 2.7?

Background

Possible approaches

I have no strong preference yet. Here are some random ideas:

  • Optimize the assertion code somehow
  • Enable the new assertions only on CIs, at least ones in hot spots
    • Not sure which places have large impact on Optcarrot yet
  • Make some other not-so-important assertions CI-only to offset the impact from new ones
  • Provide .so for an assertion-enabled mode? (ko1's idea)

I hope people will comment more ideas in this ticket.


Related issues 1 (0 open1 closed)

Related to Ruby - Bug #16840: Decrease in Hash#[]= performance with object keysClosedActions

Updated by k0kubun (Takashi Kokubun) over 5 years ago Actions #1

  • Tracker changed from Bug to Feature
  • Backport deleted (2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN)

Updated by k0kubun (Takashi Kokubun) over 5 years ago Actions #2

  • Description updated (diff)

Updated by k0kubun (Takashi Kokubun) over 5 years ago Actions #3

  • Description updated (diff)

Updated by shyouhei (Shyouhei Urabe) over 5 years ago Actions #4 [ruby-core:98182]

I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG. This has been the standard C manner since long before Ruby's birth.

Updated by shyouhei (Shyouhei Urabe) over 5 years ago Actions #5 [ruby-core:98183]

Some analysis of the slowdown.

Looking at the generated binary and perf output, the slowdown is because some functions are not inlined. Might depend on compilers, but for me rb_array_len() is one of such victim:

zsh % gdb -batch -ex 'file miniruby' -ex 'disassemble rb_array_len'
Dump of assembler code for function rb_array_len:
   0x0000000000295540 <+0>:     push   %rbx
   0x0000000000295541 <+1>:     mov    %rdi,%rbx
   0x0000000000295544 <+4>:     test   $0x7,%bl
   0x0000000000295547 <+7>:     jne    0x2955be <rb_array_len+126>
   0x0000000000295549 <+9>:     mov    %rbx,%rax
   0x000000000029554c <+12>:    and    $0xfffffffffffffff7,%rax
   0x0000000000295550 <+16>:    je     0x2955be <rb_array_len+126>
   0x0000000000295552 <+18>:    mov    (%rbx),%rax
   0x0000000000295555 <+21>:    mov    %eax,%edx
   0x0000000000295557 <+23>:    and    $0x1f,%edx
   0x000000000029555a <+26>:    mov    $0x7,%ecx
   0x000000000029555f <+31>:    cmp    $0x7,%edx
   0x0000000000295562 <+34>:    jne    0x295585 <rb_array_len+69>
   0x0000000000295564 <+36>:    test   $0x2000,%eax                 ;; <- This is `RB_FL_ANY_RAW(a, RARRAY_EMBED_FLAG)`
   0x0000000000295569 <+41>:    jne    0x295571 <rb_array_len+49>
   0x000000000029556b <+43>:    mov    0x10(%rbx),%rax              ;; <-
   0x000000000029556f <+47>:    pop    %rbx                         ;; <- This is `return RARRAY(a)->as.heap.len;`
   0x0000000000295570 <+48>:    retq                                ;; <-
   0x0000000000295571 <+49>:    cmp    $0x7,%ecx
   0x0000000000295574 <+52>:    jne    0x2955a2 <rb_array_len+98>
   0x0000000000295576 <+54>:    test   $0x2000,%eax
   0x000000000029557b <+59>:    je     0x2955ea <rb_array_len+170>
   0x000000000029557d <+61>:    shr    $0xf,%eax                    ;; <-
   0x0000000000295580 <+64>:    and    $0x3,%eax                    ;; <- This is `return RARRAY_EMBED_LEN(a);`
   0x0000000000295583 <+67>:    pop    %rbx                         ;; <-
   0x0000000000295584 <+68>:    retq                                ;; <-
   0x0000000000295585 <+69>:    mov    %rbx,%rdi
   0x0000000000295588 <+72>:    mov    $0x7,%esi
   0x000000000029558d <+77>:    callq  0xcaea2 <rb_check_type>
   0x0000000000295592 <+82>:    mov    (%rbx),%rax
   0x0000000000295595 <+85>:    mov    %eax,%ecx
   0x0000000000295597 <+87>:    and    $0x1f,%ecx
   0x000000000029559a <+90>:    cmp    $0x1b,%rcx
   0x000000000029559e <+94>:    jne    0x295564 <rb_array_len+36>
   0x00000000002955a0 <+96>:    jmp    0x2955cb <rb_array_len+139>
   0x00000000002955a2 <+98>:    mov    %rbx,%rdi
   0x00000000002955a5 <+101>:   mov    $0x7,%esi
   0x00000000002955aa <+106>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955af <+111>:   mov    (%rbx),%rax
   0x00000000002955b2 <+114>:   mov    %eax,%ecx
   0x00000000002955b4 <+116>:   and    $0x1f,%ecx
   0x00000000002955b7 <+119>:   cmp    $0x1b,%ecx
   0x00000000002955ba <+122>:   jne    0x295576 <rb_array_len+54>
   0x00000000002955bc <+124>:   jmp    0x2955cb <rb_array_len+139>
   0x00000000002955be <+126>:   mov    %rbx,%rdi
   0x00000000002955c1 <+129>:   mov    $0x7,%esi
   0x00000000002955c6 <+134>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955cb <+139>:   lea    0x142fe(%rip),%rdi        # 0x2a98d0
   0x00000000002955d2 <+146>:   lea    0x1432f(%rip),%rdx        # 0x2a9908
   0x00000000002955d9 <+153>:   lea    0x14337(%rip),%rcx        # 0x2a9917
   0x00000000002955e0 <+160>:   mov    $0xea,%esi
   0x00000000002955e5 <+165>:   callq  0xcad86 <rb_assert_failure>
   0x00000000002955ea <+170>:   lea    0x14338(%rip),%rdi        # 0x2a9929
   0x00000000002955f1 <+177>:   lea    0x1436d(%rip),%rdx        # 0x2a9965
   0x00000000002955f8 <+184>:   lea    0x14377(%rip),%rcx        # 0x2a9976
   0x00000000002955ff <+191>:   mov    $0x79,%esi
   0x0000000000295604 <+196>:   callq  0xcad86 <rb_assert_failure>
End of assembler dump.

Here, assertions practically never fail. This means jumps are 100% predicted (almost no-op). They don't slow things. The problem is those unreachable branches. If you can read the assembly you see almost 2/3 of the above function just never reach. They blow the generated binary up significantly. rb_array_len is thus now considered too big to be inlined, to my compiler at least.

An obvious ad-hoc remedy is to supply __attribute__((__always_inline__)) for everything. But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.

Updated by shyouhei (Shyouhei Urabe) over 5 years ago Actions #6 [ruby-core:98184]

If you recompile everything using ./configure cppflags=-DNDEBUG, then those assertions are eliminated, to let compilers inline rb_array_len again.

Updated by shevegen (Robert A. Heiler) over 5 years ago Actions #7 [ruby-core:98185]

I have a question concerning one point mentioned above.

k0kubun wrote:

Provide .so for an assertion-enabled mode? (ko1's idea)

Could someone briefly explain the general idea behind this? I assume for a .so
file the ruby user would have to require/load that file, but what may be the
perceived benefits/disadvantages for doing so?

Updated by k0kubun (Takashi Kokubun) over 5 years ago Actions #8 [ruby-core:98194]

I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG. This has been the standard C manner since long before Ruby's birth.

Got it. I'll consider using -DNDEBUG in benchmark servers at least. Also maybe it's worth noting it in NEWS for those who package Ruby for performance-sensitive usages?

An obvious ad-hoc remedy is to supply __attribute__((__always_inline__)) for everything. But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.

Agreed. While it's not a good idea to always inline everything, some may be worth a consideration though.

I assume for a .so file the ruby user would have to require/load that file

His idea was to install the .so file to Ruby prefix by default and add a --debug-xxx option to load it.

Updated by k0kubun (Takashi Kokubun) over 5 years ago Actions #9

  • Related to Bug #16840: Decrease in Hash#[]= performance with object keys added

Updated by nobu (Nobuyoshi Nakada) over 5 years ago Actions #10 [ruby-core:98212]

Not only assertions, some optimizations can no longer be applied.

For instance, rb_str_new_cstr was defined as following in 2.7,

#define rb_str_new_cstr(str) RB_GNUC_EXTENSION_BLOCK(	\
    (__builtin_constant_p(str)) ?		\
	rb_str_new_static((str), (long)strlen(str)) : \
	rb_str_new_cstr(str)			\
)

and rb_str_new_cstr("...") has been expected to be compiled as rb_str_new_static("...", 3).

The below is the master version.

static inline VALUE
ruby3_str_new_cstr(const char *str)
{
    if /* constexpr */ (! RUBY3_CONSTANT_P(str)) {
        return rb_str_new_cstr(str);
    }
    else {
        long len = ruby3_strlen(str);
        return rb_str_new_static(str, len);
    }
}

As str is an argument variable and RUBY3_CONSTANT_P(str) is always false here, _static function is never used (in Apple clang 11.0.3 and gcc 10.1.0-RC-20200430_0).

I'm uncertain how this particular case affects the whole performance though, similar un-optimizations might be more.

Updated by shyouhei (Shyouhei Urabe) over 5 years ago Actions #11 [ruby-core:98214]

nobu (Nobuyoshi Nakada) wrote in #note-10:

As str is an argument variable and RUBY3_CONSTANT_P(str) is always false here,

Well, thank you pointing this out. As I wrote in include/ruby/3/constant_p.h, you can apply __builtin_constant_p to an inline function argument. I thought that RUBY3_CONSTANT_P(str) is not always false. However https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html says:

You may use this built-in function in either a macro or an inline function. However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in, GCC never returns 1 when you call the inline function with a string constant or ...

In this ruby3_str_new_cstr()'s particular case, the argument is a string. There is no chance. This is in fact wrong. We have to fix.

Updated by naruse (Yui NARUSE) over 5 years ago Actions #12 [ruby-core:98264]

I want Ruby 2.8/3.0 is faster than 2.7 by default.
NDEBUG is not acceptable.
I think Microsoft's _DEBUG approach is more reasonable.

Updated by shyouhei (Shyouhei Urabe) over 5 years ago Actions #13 [ruby-core:98277]

naruse (Yui NARUSE) wrote in #note-12:

NDEBUG is not acceptable.

NDEBUG is not my invention. Please file a bug report to upstream (ISO/IEC JTC1/SC22/WG14).

I'm not against defining it by default, though.

Updated by ko1 (Koichi Sasada) over 5 years ago Actions #14

  • Status changed from Open to Closed

Applied in changeset git|21991e6ca59274e41a472b5256bd3245f6596c90.


Use RUBY_DEBUG instead of NDEBUG

Assertions in header files slows down an interpreter, so they should be
turned off by default (simple make). To enable them, define a macro
RUBY_DEBUG=1 (e.g. make cppflags=-DRUBY_DEBUG or use #define at
the very beggining of the file. Note that even if NDEBUG=1 is defined,
RUBY_DEBUG=1 enables all assertions.
[Feature #16837]
related: https://github.com/ruby/ruby/pull/3120

assert() lines in MRI *.c is not disabled even if RUBY_DEBUG=0 and
it can be disabled with NDEBUG=1. So please consider to use
RUBY_ASSERT() if you want to disable them when RUBY_DEBUG=0.

Actions

Also available in: PDF Atom