Feature #15085

Decrease memory cache usage of MJIT

Added by wanabe (_ wanabe) 4 months ago. Updated 2 months ago.

Target version:


MJIT makes ruby-methods faster by ordinary, but I have observed that some cases are exceptional.
I guess the one is caused by invokesuper instruction.
And I guess it is related to memory caching, especially iTLB.

Attached "export-big-func.patch" makes MJIT binary code for invokesuper smaller.
"super.rb" is a benchmark script with benchmark_driver.
"benchmark.log" is a result of super.rb.
"benchmark-with-perf.log" is another result with PERF_STAT environment variable.
The results are merely in my environment and depend to a large part on machine specs.

invokesuper can get faster with exported vm_search_super_method(), but I think it is not enough.
Because perf stat shows that there are still many iTLB-load-misses.
I believe MJIT can grow fast with good care for CPU memory cache, not only iTLB but also L1 / L2 and so on.


export-big-func.patch (934 Bytes) export-big-func.patch wanabe (_ wanabe), 09/06/2018 10:57 PM
super.rb (897 Bytes) super.rb wanabe (_ wanabe), 09/06/2018 10:59 PM
benchmark.log (624 Bytes) benchmark.log wanabe (_ wanabe), 09/06/2018 11:03 PM
benchmark-with-perf.log (7.05 KB) benchmark-with-perf.log wanabe (_ wanabe), 09/06/2018 11:03 PM
export-vm_call_super_method.patch (2.66 KB) export-vm_call_super_method.patch wanabe (_ wanabe), 09/12/2018 11:45 PM
benchmark-with-perf-on-preview3.log (9.99 KB) benchmark-with-perf-on-preview3.log wanabe (_ wanabe), 11/09/2018 12:11 AM


Updated by k0kubun (Takashi Kokubun) 4 months ago

As long as I can see from the benchmark result for the improved case, it looks good. But at least I would like to see micro benchmarks for opt_send_without_block and send. Because of _mjit_compile_send, it may not be affected so much though. Also, how was the result for larger benchmarks (optcarrot, discourse, ...)?

And I guess it is related to memory caching, especially iTLB.
invokesuper can get faster with exported vm_search_super_method(), but I think it is not enough.

My assumption on exporting only rb_vm_search_method_slowpath was that we should inline things as much as possible to exploit compiler optimizations but compiling (rb_vm_search_method_slowpath part of) vm_search_method was too slow to compile many methods within the default Optcarrot measurement period. I didn't care CPU cache for not compiling it, and I assume we should inline everything if compilation finishes in 0 second.

Why do you think not inlining vm_search_method is more friendly for iTLB? Is the generated code size for vm_search_method is too big, or is loading instructions from vm_search_method efficient when the code for vm_search_method is shared with VM?

Updated by wanabe (_ wanabe) 4 months ago

I am sorry in advance, I've decided to withdraw this ticket and its patch.
I tried to reveal what's going on and explain it, but end up getting nowhere.

I also tried to explain the reason that I had reached to vm_search_super_method and iTLB,
but I had forgotten to write a memo and I can't remember now.

To make matters worse, my assumption "Big function makes iTLB-load-count bad" is totally wrong.
For example, attached "export-vm_call_super_method.patch" shows almost same result on my environment.
(Note that the JIT compile time is as same as trunk)
Although vm_call_super_method is a very small function.

Warming up --------------------------------------
        344.327k i/s
Calculating -------------------------------------
                          trunk  trunk,--jit  export-big-func  export-big-func,--jit  export-vm_call_super_method  export-vm_call_super_method,--jit 
        353.285k     224.659k         340.091k               368.875k                     343.167k                           386.849k i/s -      1.033M times in 2.923926s 4.597989s 3.037365s 2.800350s 3.010142s 2.670239s

export-vm_call_super_method,--jit:    386849.3 i/s 
export-big-func,--jit:    368875.3 i/s - 1.05x  slower
               trunk:    353285.2 i/s - 1.10x  slower
export-vm_call_super_method:    343166.5 i/s - 1.13x  slower
     export-big-func:    340090.8 i/s - 1.14x  slower
         trunk,--jit:    224659.1 i/s - 1.72x  slower

So I gave up this ticket until at least I can explain.
I'm sorry for confusing you.

Updated by k0kubun (Takashi Kokubun) 4 months ago

I see. Thank you for the experiment and taking time for the investigation.

Updated by wanabe (_ wanabe) 2 months ago

The issue is almost gone on v2_6_0_preview3.
invokesuper on MJIT runs as about fast as on normal VM.

Attached "benchmark-with-perf-on-preview3.log" is benchmark result.

Also available in: Atom PDF