Feature #16837
closedCan we make Ruby 3.0 as fast as Ruby 2.7 with the new assertions?
Added by k0kubun (Takashi Kokubun) over 5 years ago. Updated over 5 years ago.
Description
Problem¶
How can we make Ruby 3.0 as fast as (or faster than) Ruby 2.7?
Background¶
- Split ruby.h https://github.com/ruby/ruby/pull/2991 added some new assertions
- While it has been helpful for revealing various bugs, it also made some Ruby programs notably slow, especially Optcarrot https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html
Possible approaches¶
I have no strong preference yet. Here are some random ideas:
- Optimize the assertion code somehow
- Enable the new assertions only on CIs, at least ones in hot spots
- Not sure which places have large impact on Optcarrot yet
 
- Make some other not-so-important assertions CI-only to offset the impact from new ones
- Provide .so for an assertion-enabled mode? (ko1's idea)
I hope people will comment more ideas in this ticket.
        
           Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #1
          Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #1
        
      
      - Tracker changed from Bug to Feature
- Backport deleted (2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN)
        
           Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #2
          Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #2
        
      
      - Description updated (diff)
        
           Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #3
          Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #3
        
      
      - Description updated (diff)
        
           Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #4
            [ruby-core:98182]
          Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #4
            [ruby-core:98182]
        
      
      I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG.  This has been the standard C manner since long before Ruby's birth.
        
           Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #5
            [ruby-core:98183]
          Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #5
            [ruby-core:98183]
        
      
      Some analysis of the slowdown.
Looking at the generated binary and perf output, the slowdown is because some functions are not inlined.  Might depend on compilers, but for me rb_array_len() is one of such victim:
zsh % gdb -batch -ex 'file miniruby' -ex 'disassemble rb_array_len'
Dump of assembler code for function rb_array_len:
   0x0000000000295540 <+0>:     push   %rbx
   0x0000000000295541 <+1>:     mov    %rdi,%rbx
   0x0000000000295544 <+4>:     test   $0x7,%bl
   0x0000000000295547 <+7>:     jne    0x2955be <rb_array_len+126>
   0x0000000000295549 <+9>:     mov    %rbx,%rax
   0x000000000029554c <+12>:    and    $0xfffffffffffffff7,%rax
   0x0000000000295550 <+16>:    je     0x2955be <rb_array_len+126>
   0x0000000000295552 <+18>:    mov    (%rbx),%rax
   0x0000000000295555 <+21>:    mov    %eax,%edx
   0x0000000000295557 <+23>:    and    $0x1f,%edx
   0x000000000029555a <+26>:    mov    $0x7,%ecx
   0x000000000029555f <+31>:    cmp    $0x7,%edx
   0x0000000000295562 <+34>:    jne    0x295585 <rb_array_len+69>
   0x0000000000295564 <+36>:    test   $0x2000,%eax                 ;; <- This is `RB_FL_ANY_RAW(a, RARRAY_EMBED_FLAG)`
   0x0000000000295569 <+41>:    jne    0x295571 <rb_array_len+49>
   0x000000000029556b <+43>:    mov    0x10(%rbx),%rax              ;; <-
   0x000000000029556f <+47>:    pop    %rbx                         ;; <- This is `return RARRAY(a)->as.heap.len;`
   0x0000000000295570 <+48>:    retq                                ;; <-
   0x0000000000295571 <+49>:    cmp    $0x7,%ecx
   0x0000000000295574 <+52>:    jne    0x2955a2 <rb_array_len+98>
   0x0000000000295576 <+54>:    test   $0x2000,%eax
   0x000000000029557b <+59>:    je     0x2955ea <rb_array_len+170>
   0x000000000029557d <+61>:    shr    $0xf,%eax                    ;; <-
   0x0000000000295580 <+64>:    and    $0x3,%eax                    ;; <- This is `return RARRAY_EMBED_LEN(a);`
   0x0000000000295583 <+67>:    pop    %rbx                         ;; <-
   0x0000000000295584 <+68>:    retq                                ;; <-
   0x0000000000295585 <+69>:    mov    %rbx,%rdi
   0x0000000000295588 <+72>:    mov    $0x7,%esi
   0x000000000029558d <+77>:    callq  0xcaea2 <rb_check_type>
   0x0000000000295592 <+82>:    mov    (%rbx),%rax
   0x0000000000295595 <+85>:    mov    %eax,%ecx
   0x0000000000295597 <+87>:    and    $0x1f,%ecx
   0x000000000029559a <+90>:    cmp    $0x1b,%rcx
   0x000000000029559e <+94>:    jne    0x295564 <rb_array_len+36>
   0x00000000002955a0 <+96>:    jmp    0x2955cb <rb_array_len+139>
   0x00000000002955a2 <+98>:    mov    %rbx,%rdi
   0x00000000002955a5 <+101>:   mov    $0x7,%esi
   0x00000000002955aa <+106>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955af <+111>:   mov    (%rbx),%rax
   0x00000000002955b2 <+114>:   mov    %eax,%ecx
   0x00000000002955b4 <+116>:   and    $0x1f,%ecx
   0x00000000002955b7 <+119>:   cmp    $0x1b,%ecx
   0x00000000002955ba <+122>:   jne    0x295576 <rb_array_len+54>
   0x00000000002955bc <+124>:   jmp    0x2955cb <rb_array_len+139>
   0x00000000002955be <+126>:   mov    %rbx,%rdi
   0x00000000002955c1 <+129>:   mov    $0x7,%esi
   0x00000000002955c6 <+134>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955cb <+139>:   lea    0x142fe(%rip),%rdi        # 0x2a98d0
   0x00000000002955d2 <+146>:   lea    0x1432f(%rip),%rdx        # 0x2a9908
   0x00000000002955d9 <+153>:   lea    0x14337(%rip),%rcx        # 0x2a9917
   0x00000000002955e0 <+160>:   mov    $0xea,%esi
   0x00000000002955e5 <+165>:   callq  0xcad86 <rb_assert_failure>
   0x00000000002955ea <+170>:   lea    0x14338(%rip),%rdi        # 0x2a9929
   0x00000000002955f1 <+177>:   lea    0x1436d(%rip),%rdx        # 0x2a9965
   0x00000000002955f8 <+184>:   lea    0x14377(%rip),%rcx        # 0x2a9976
   0x00000000002955ff <+191>:   mov    $0x79,%esi
   0x0000000000295604 <+196>:   callq  0xcad86 <rb_assert_failure>
End of assembler dump.
Here, assertions practically never fail.  This means jumps are 100% predicted (almost no-op).  They don't slow things.  The problem is those unreachable branches.  If you can read the assembly you see almost 2/3 of the above function just never reach.  They blow the generated binary up significantly.  rb_array_len is thus now considered too big to be inlined, to my compiler at least.
An obvious ad-hoc remedy is to supply __attribute__((__always_inline__)) for everything.  But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.
        
           Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #6
            [ruby-core:98184]
          Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #6
            [ruby-core:98184]
        
      
      If you recompile everything using ./configure cppflags=-DNDEBUG, then those assertions are eliminated, to let compilers inline rb_array_len again.
        
           Updated by shevegen (Robert A. Heiler) over 5 years ago
          
          
        
        
          
            Actions
          
          #7
            [ruby-core:98185]
          Updated by shevegen (Robert A. Heiler) over 5 years ago
          
          
        
        
          
            Actions
          
          #7
            [ruby-core:98185]
        
      
      I have a question concerning one point mentioned above.
k0kubun wrote:
Provide .so for an assertion-enabled mode? (ko1's idea)
Could someone briefly explain the general idea behind this? I assume for a .so
file the ruby user would have to require/load that file, but what may be the
perceived benefits/disadvantages for doing so?
        
           Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #8
            [ruby-core:98194]
          Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #8
            [ruby-core:98194]
        
      
      I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG. This has been the standard C manner since long before Ruby's birth.
Got it. I'll consider using -DNDEBUG in benchmark servers at least. Also maybe it's worth noting it in NEWS for those who package Ruby for performance-sensitive usages?
An obvious ad-hoc remedy is to supply
__attribute__((__always_inline__))for everything. But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.
Agreed. While it's not a good idea to always inline everything, some may be worth a consideration though.
I assume for a .so file the ruby user would have to require/load that file
His idea was to install the .so file to Ruby prefix by default and add a --debug-xxx option to load it.
        
           Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #9
          Updated by k0kubun (Takashi Kokubun) over 5 years ago
          
          
        
        
          
            Actions
          
          #9
        
      
      - Related to Bug #16840: Decrease in Hash#[]= performance with object keys added
        
           Updated by nobu (Nobuyoshi Nakada) over 5 years ago
          
          
        
        
          
            Actions
          
          #10
            [ruby-core:98212]
          Updated by nobu (Nobuyoshi Nakada) over 5 years ago
          
          
        
        
          
            Actions
          
          #10
            [ruby-core:98212]
        
      
      Not only assertions, some optimizations can no longer be applied.
For instance, rb_str_new_cstr was defined as following in 2.7,
#define rb_str_new_cstr(str) RB_GNUC_EXTENSION_BLOCK(	\
    (__builtin_constant_p(str)) ?		\
	rb_str_new_static((str), (long)strlen(str)) : \
	rb_str_new_cstr(str)			\
)
and rb_str_new_cstr("...") has been expected to be compiled as rb_str_new_static("...", 3).
The below is the master version.
static inline VALUE
ruby3_str_new_cstr(const char *str)
{
    if /* constexpr */ (! RUBY3_CONSTANT_P(str)) {
        return rb_str_new_cstr(str);
    }
    else {
        long len = ruby3_strlen(str);
        return rb_str_new_static(str, len);
    }
}
As str is an argument variable and RUBY3_CONSTANT_P(str) is always false here, _static function is never used (in Apple clang 11.0.3 and gcc 10.1.0-RC-20200430_0).
I'm uncertain how this particular case affects the whole performance though, similar un-optimizations might be more.
        
           Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #11
            [ruby-core:98214]
          Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #11
            [ruby-core:98214]
        
      
      nobu (Nobuyoshi Nakada) wrote in #note-10:
As
stris an argument variable andRUBY3_CONSTANT_P(str)is always false here,
Well, thank you pointing this out.  As I wrote in include/ruby/3/constant_p.h, you can apply __builtin_constant_p to an inline function argument.  I thought that RUBY3_CONSTANT_P(str) is not always false.  However https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html says:
You may use this built-in function in either a macro or an inline function. However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in, GCC never returns 1 when you call the inline function with a string constant or ...
In this ruby3_str_new_cstr()'s particular case, the argument is a string.  There is no chance.  This is in fact wrong.  We have to fix.
        
           Updated by naruse (Yui NARUSE) over 5 years ago
          
          
        
        
          
            Actions
          
          #12
            [ruby-core:98264]
          Updated by naruse (Yui NARUSE) over 5 years ago
          
          
        
        
          
            Actions
          
          #12
            [ruby-core:98264]
        
      
      I want Ruby 2.8/3.0 is faster than 2.7 by default.
NDEBUG is not acceptable.
I think Microsoft's _DEBUG approach is more reasonable.
        
           Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #13
            [ruby-core:98277]
          Updated by shyouhei (Shyouhei Urabe) over 5 years ago
          
          
        
        
          
            Actions
          
          #13
            [ruby-core:98277]
        
      
      naruse (Yui NARUSE) wrote in #note-12:
NDEBUG is not acceptable.
NDEBUG is not my invention. Please file a bug report to upstream (ISO/IEC JTC1/SC22/WG14).
I'm not against defining it by default, though.
        
           Updated by ko1 (Koichi Sasada) over 5 years ago
          
          
        
        
          
            Actions
          
          #14
          Updated by ko1 (Koichi Sasada) over 5 years ago
          
          
        
        
          
            Actions
          
          #14
        
      
      - Status changed from Open to Closed
Applied in changeset git|21991e6ca59274e41a472b5256bd3245f6596c90.
Use RUBY_DEBUG instead of NDEBUG
Assertions in header files slows down an interpreter, so they should be
turned off by default (simple make). To enable them, define a macro
RUBY_DEBUG=1 (e.g. make cppflags=-DRUBY_DEBUG or use #define at
the very beggining of the file. Note that even if NDEBUG=1 is defined,
RUBY_DEBUG=1 enables all assertions.
[Feature #16837]
related: https://github.com/ruby/ruby/pull/3120
assert() lines in MRI *.c is not disabled even if RUBY_DEBUG=0 and
it can be disabled with NDEBUG=1. So please consider to use
RUBY_ASSERT() if you want to disable them when RUBY_DEBUG=0.