Misc #15007


Let all Init_xxx and extension APIs frequently called from init code paths be considered cold

Added by methodmissing (Lourens Naudé) about 3 years ago. Updated almost 3 years ago.



References Github PR


An incremental extraction from PR, specifically addressing the feedback from Yui Naruse in

The Linux kernel, PHP 7 and other projects use the hot and cold function attributes to help with better code layout.

I noticed Ruby is very much CPU frontend bound (not feeding instructions into the CPU pipelines as fast as it maybe could) and therefore even most micro benchmarks have a high CPI (cycles per instruction) rate. This PR is part of a larger chunk of work I'd like to do around improving CPU frontend throughput and can take a stab at formally writing up those ideas if there's any interest from the community. I don't know.


This PR has an exclusive focus on having the Init_xxx functions for the core classes and those bundled in ext being flagged to be optimized for size as they're called only once at runtime.

The GCC specific cold function attribute works in the following way (from GCC docs):

The cold attribute is used to inform the compiler that a function is unlikely executed. The function is optimized for size rather than speed and on many targets it is placed into special subsection of the text section so all cold functions appears close together improving code locality of non-cold parts of program. The paths leading to call of cold functions within code are marked as unlikely by the branch prediction mechanism. It is thus useful to mark functions used to handle unlikely conditions, such as perror, as cold to improve optimization of hot functions that do call marked functions in rare occasions.
When profile feedback is available, via -fprofile-use, hot functions are automatically detected and this attribute is ignored.

By declaring a function as cold when defined we get the following benefits:

  • No-op on platforms that does not support the attribute
  • Size optimization of cold functions with a smaller footprint in the instruction cache
  • Therefore CPU frontend throughput increases due to a lower ratio of instruction cache misses and a lower ITLB overhead - see original chunky PR VS then trunk
  • This effect can further be amplified in future work with the hot attribute

Extension APIs flagged as cold

These are and should typically only be called on extension init, and thus safe to optimize for size as well.

  • void rb_define_method_id(VALUE, ID, VALUE (*)(ANYARGS), int));
  • void rb_undef(VALUE, ID));
  • void rb_define_protected_method(VALUE, const char*, VALUE (*)(ANYARGS), int));
  • void rb_define_private_method(VALUE, const char*, VALUE (*)(ANYARGS), int));
  • void rb_define_singleton_method(VALUE, const char*, VALUE(*)(ANYARGS), int));
  • void rb_define_alloc_func(VALUE, rb_alloc_func_t));
  • void rb_undef_alloc_func(VALUE));
  • VALUE rb_define_class(const char*,VALUE));
  • VALUE rb_define_module(const char*));
  • VALUE rb_define_class_under(VALUE, const char*, VALUE));
  • VALUE rb_define_module_under(VALUE, const char*));
  • void rb_define_variable(const char*,VALUE*));
  • void rb_define_virtual_variable(const char*,VALUE(*)(ANYARGS),void(*)(ANYARGS)));
  • void rb_define_hooked_variable(const char*,VALUE*,VALUE(*)(ANYARGS),void(*)(ANYARGS)));
  • void rb_define_readonly_variable(const char*,const VALUE*));
  • void rb_define_const(VALUE,const char*,VALUE));
  • void rb_define_global_const(const char*,VALUE));
  • void rb_define_method(VALUE,const char*,VALUE(*)(ANYARGS),int));
  • (void rb_define_module_function(VALUE,const char*,VALUE(*)(ANYARGS),int));
  • void rb_define_global_function(const char*,VALUE(*)(ANYARGS),int));
  • void rb_undef_method(VALUE,const char*));
  • void rb_define_alias(VALUE,const char*,const char*));
  • void rb_define_attr(VALUE,const char*,int,int));
  • void rb_global_variable(VALUE*));
  • void rb_gc_register_mark_object(VALUE));
  • void rb_gc_register_address(VALUE*));
  • void rb_gc_unregister_address(VALUE*));

Text segment reductions

Small changes (3144 bytes reduction of the text segment) because this is incremental groundwork and and initial low risk PR.

this branch:

lourens@CarbonX1:~/src/ruby/ruby$ size ruby
   text    data     bss     dec     hex filename
3462153   21056   71344 3554553  363cf9 ruby


lourens@CarbonX1:~/src/ruby/trunk$ size ruby
   text    data     bss     dec     hex filename
3465297   21056   71344 3557697  364941 ruby

Diffs for individual object files:

Default text.unlikely section where init functions are moved to:

lourens@CarbonX1:~/src/ruby/ruby$ readelf -S vm.o
There are 34 section headers, starting at offset 0x2a04f8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       000000000001c37f  0000000000000000  AX       0     0     16
  [ 2] .rela.text        RELA             0000000000000000  00114100
       000000000000a7d0  0000000000000018   I      31     1     8
  [ 3] .data             PROGBITS         0000000000000000  0001c3c0
       0000000000000030  0000000000000000  WA       0     0     16
  [ 4] .bss              NOBITS           0000000000000000  0001c400
       00000000000002b0  0000000000000000  WA       0     0     32
  [ 5] .rodata.str1.8    PROGBITS         0000000000000000  0001c400
       0000000000000d6f  0000000000000001 AMS       0     0     8
  [ 6] .text.unlikely    PROGBITS         0000000000000000  0001d16f <<<<<<<<<<<<<<<
       0000000000001aa9  0000000000000000  AX       0     0     1

The relocations for vm.o:

lourens@CarbonX1:~/src/ruby/ruby$ ld -M vm.o
--- truncated ---
.text           0x0000000000400120    0x1de2f
 *(.text.unlikely .text.*_unlikely .text.unlikely.*)
                0x0000000000400120     0x1aa9 vm.o
                0x000000000040038f                rb_define_alloc_func
                0x00000000004003bf                rb_undef_alloc_func
                0x00000000004003c5                Init_Method
                0x0000000000400512                Init_vm_eval
                0x00000000004007a1                Init_eval_method
                0x0000000000400a54                rb_undef
                0x0000000000400c1d                Init_VM
                0x000000000040185f                Init_BareVM
                0x0000000000401b16                Init_vm_objects
                0x0000000000401b61                Init_top_self
 *(.text.exit .text.exit.*)
 *(.text.startup .text.startup.*)
 *(.text .stub .text.* .gnu.linkonce.t.*)
 *fill*         0x0000000000401bc9        0x7 
 .text          0x0000000000401bd0    0x1c37f vm.o
                0x00000000004022f0                rb_f_notimplement
                0x0000000000404780                rb_vm_ep_local_ep
                0x00000000004047b0                rb_vm_frame_block_handler
                0x00000000004047e0                rb_vm_cref_new_toplevel
                0x0000000000404870                rb_vm_block_ep_update
                0x0000000000404890                ruby_vm_special_exception_copy
                0x0000000000406960                rb_ec_stack_overflow
                0x00000000004069c0                rb_vm_push_frame
                0x0000000000406b20                rb_vm_pop_frame
                0x0000000000406b30                rb_error_arity
                0x0000000000407180                rb_vm_frame_method_entry
                0x00000000004075e0                rb_vm_rewrite_cref
                0x00000000004076f0                rb_simple_iseq_p
                0x0000000000407700                rb_vm_opt_struct_aref
                0x0000000000407730                rb_vm_opt_struct_aset
                0x0000000000407750                rb_clear_constant_cache
--- truncated ---

I also dabbled with the idea of an INITFUNC macro that also places the Init_xxx functions into a text.init section as the kernel does for a possible future optimization of stripping out ELF sections for setup / init specific functions. I don't think that makes sense for now and possibly only interesting for mruby or embedded.

Possible next units of work

Cold code specific

TLB (translation lookaside buffer) specific

  • Further ITLB overhead investigation
  • Ruby binaries built with O3 and debug symbols come in at just short of 18MB, or roughly 9 hugepages on linux. PHP core developers were able to squeeze a few % by remapping code to hugepages on supported systems - . Implementation here

Bytecode specific

  • The Intel Tracing Task API is very well suited for the instruction sequences YARV generates and to infer better per instruction CPU utilization and identify any stalls (frontend, backend, branches etc.) to drive further work.

Also available in: Atom PDF