Feature #15085
closedDecrease memory cache usage of MJIT
Description
MJIT makes ruby-methods faster by ordinary, but I have observed that some cases are exceptional.
I guess the one is caused by invokesuper
instruction.
And I guess it is related to memory caching, especially iTLB.
Attached "export-big-func.patch" makes MJIT binary code for invokesuper
smaller.
"super.rb" is a benchmark script with benchmark_driver.
"benchmark.log" is a result of super.rb.
"benchmark-with-perf.log" is another result with PERF_STAT
environment variable.
The results are merely in my environment and depend to a large part on machine specs.
invokesuper
can get faster with exported vm_search_super_method()
, but I think it is not enough.
Because perf stat
shows that there are still many iTLB-load-misses.
I believe MJIT can grow fast with good care for CPU memory cache, not only iTLB but also L1 / L2 and so on.
Files
Updated by k0kubun (Takashi Kokubun) over 6 years ago
As long as I can see from the benchmark result for the improved case, it looks good. But at least I would like to see micro benchmarks for opt_send_without_block and send. Because of _mjit_compile_send, it may not be affected so much though. Also, how was the result for larger benchmarks (optcarrot, discourse, ...)?
And I guess it is related to memory caching, especially iTLB.
invokesuper can get faster with exported vm_search_super_method(), but I think it is not enough.
My assumption on exporting only rb_vm_search_method_slowpath
was that we should inline things as much as possible to exploit compiler optimizations but compiling (rb_vm_search_method_slowpath
part of) vm_search_method
was too slow to compile many methods within the default Optcarrot measurement period. I didn't care CPU cache for not compiling it, and I assume we should inline everything if compilation finishes in 0 second.
Why do you think not inlining vm_search_method
is more friendly for iTLB? Is the generated code size for vm_search_method
is too big, or is loading instructions from vm_search_method efficient when the code for vm_search_method is shared with VM?
Updated by wanabe (_ wanabe) over 6 years ago
- File export-vm_call_super_method.patch export-vm_call_super_method.patch added
- Status changed from Open to Rejected
I am sorry in advance, I've decided to withdraw this ticket and its patch.
I tried to reveal what's going on and explain it, but end up getting nowhere.
I also tried to explain the reason that I had reached to vm_search_super_method
and iTLB,
but I had forgotten to write a memo and I can't remember now.
To make matters worse, my assumption "Big function makes iTLB-load-count bad" is totally wrong.
For example, attached "export-vm_call_super_method.patch" shows almost same result on my environment.
(Note that the JIT compile time is as same as trunk)
Although vm_call_super_method
is a very small function.
Warming up --------------------------------------
a.foo 344.327k i/s
Calculating -------------------------------------
trunk trunk,--jit export-big-func export-big-func,--jit export-vm_call_super_method export-vm_call_super_method,--jit
a.foo 353.285k 224.659k 340.091k 368.875k 343.167k 386.849k i/s - 1.033M times in 2.923926s 4.597989s 3.037365s 2.800350s 3.010142s 2.670239s
Comparison:
a.foo
export-vm_call_super_method,--jit: 386849.3 i/s
export-big-func,--jit: 368875.3 i/s - 1.05x slower
trunk: 353285.2 i/s - 1.10x slower
export-vm_call_super_method: 343166.5 i/s - 1.13x slower
export-big-func: 340090.8 i/s - 1.14x slower
trunk,--jit: 224659.1 i/s - 1.72x slower
So I gave up this ticket until at least I can explain.
I'm sorry for confusing you.
Updated by k0kubun (Takashi Kokubun) over 6 years ago
I see. Thank you for the experiment and taking time for the investigation.
Updated by wanabe (_ wanabe) about 6 years ago
The issue is almost gone on v2_6_0_preview3.
invokesuper
on MJIT runs as about fast as on normal VM.
Attached "benchmark-with-perf-on-preview3.log" is benchmark result.