Project

General

Profile

Feature #12589

VM performance improvement proposal

Added by vmakarov (Vladimir Makarov) 8 months ago. Updated 8 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:76382]

Description

Hello. I'd like to start a big MRI project but I don't want to
disrupt somebody else plans. Therefore I'd like to have MRI
developer's opinion on the proposed project or information if somebody
is already working on an analogous project.

Basically I want to improve overall MRI VM performance:

  • First of all, I'd like to change VM insns and move from
    stack-based insns to register transfer ones. The idea behind
    it is to decrease VM dispatch overhead as approximately 2 times
    less RTL insns are necessary than stack based insns for the same
    program (for Ruby it is probably even less as a typical Ruby program
    contains a lot of method calls and the arguments are passed through
    the stack).

    But decreasing memory traffic is even more important advantage
    of RTL insns as an RTL insn can address temporaries (stack) and
    local variables in any combination. So there is no necessity to
    put an insn result on the stack and then move it to a local
    variable or put variable value on the stack and then use it as an
    insn operand. Insns doing more also provide a bigger scope for C
    compiler optimizations.

    The biggest changes will be in files compile.c and insns.def (they
    will be basically rewritten). So the project is not a new VM
    machine. MRI VM is much more than these 2 files.

    The disadvantage of RTL insns is a bigger insn memory footprint
    (which can be upto 30% more) although as I wrote there are fewer
    number of RTL insns.

    Another disadvantage of RTL insns specifically for Ruby is that
    insns for call sequences will be basically the same stack based
    ones but only bigger as they address the stack explicitly.

  • Secondly, I'd like to combine some frequent insn sequences into
    bigger insns. Again it decreases insn dispatch overhead and
    memory traffic even more. Also it permits to remove some type
    checking.

    The first thing on my mind is a sequence of a compare insn and a
    branch and using immediate operands besides temporary (stack) and
    local variables. Also it is not a trivial task for Ruby as the
    compare can be implemented as a method.

I already did some experiments. RTL insns & combining insns permits
to speed the following micro-benchmark in more 2 times:

i = 0
while i<30_000_000 # benchmark loop 1
  i += 1
end

The generated RTL insns for the benchmark are

== disasm: #<ISeq:<main>@while.rb>======================================
== catch table
| catch type: break  st: 0007 ed: 0020 sp: 0000 cont: 0020
| catch type: next   st: 0007 ed: 0020 sp: 0000 cont: 0005
| catch type: redo   st: 0007 ed: 0020 sp: 0000 cont: 0007
|------------------------------------------------------------------------
local table (size: 2, temp: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] i
0000 set_local_val    2, 0                                            (   1)
0003 jump             13                                              (   2)
0005 jump             13
0007 plusi            <callcache>, 2, 2, 1, -1                        (   3)
0013 btlti            7, <callcache>, -1, 2, 30000000, -1             (   2)
0020 local_ret        2, 0                                            (   3)

In this experiment I ignored trace insns (that is another story) and a
complication that a integer compare insn can be re-implemented as a
Ruby method. Insn bflti is combination of LT immediate compare and
branch true.

A modification of fib benchmark is sped up in 1.35 times:

def fib_m n
  if n < 1
    1
  else
    fib_m(n-1) * fib_m(n-2)
  end
end

fib_m(40)

The RTL code of fib_m looks like

== disasm: #<ISeq:fib_m@fm.rb>==========================================
local table (size: 2, temp: 3, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] n<Arg>
0000 bflti            10, <callcache>, -1, 2, 1, -1                   (   2)
0007 val_ret          1, 16
0010 minusi           <callcache>, -2, 2, 1, -2                       (   5)
0016 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -1
0020 minusi           <callcache>, -3, 2, 2, -3
0026 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -2
0030 mult             <callcache>, -1, -1, -2, -1
0036 temp_ret         -1, 16

In reality, the improvement of most programs probably will be about
10%. That is because of very dynamic nature of Ruby (a lot of calls,
checks for redefinition of basic type operations, checking overflows
to switch to GMP numbers). For example, integer addition can not be
less than about x86-64 17 insns out of the current 50 insns on the
fast path. So even if you make the rest (33) insns 2 times faster,
the improvement will be only 30%.

A very important part of MRI performance improvement is to make calls
fast because there are a lot of them in Ruby but as I read in some
Koichi Sasada's presentations he pays a lot of attention to it. So I
don't want to touch it.

  • Thirdly. I want to implement the insns as small inline functions
    for future AOT compiler, of course, if the projects described
    above are successful. It will permit easy AOT generation of C code
    which will be basically calls of the functions.

    I'd like to implement AOT compiler which will generate a Ruby
    method code, call a C compiler to generate a binary shared code
    and load it into MRI for subsequent calls. The key is to minimize
    the compilation time. There are many approaches to do it but I
    don't want to discuss it right now.

    C generation is easy and most portable implementation of AOT but
    in future it is possible to use GCC JIT plugin or LLVM IR to
    decrease overhead of C scanner/parser.

    C compiler will see a bigger scope (all method insns) to do
    optimizations. I think using AOT can give another 10%
    improvement. It is not that big again because of dynamic nature
    of Ruby and any C compiler is not smart enough to figure out
    aliasing for typical generated C program.

    The life with the performance point of view would be easy if Ruby
    did not permit to redefine basic operations for basic types,
    e.g. plus for integer. In this case we could evaluate types of
    operands and results using some data flow analysis and generate
    faster specialized insns. Still a gradual typing if it is
    introduced in future versions of Ruby would help to generate such
    faster insns.

Again I wrote this proposal for discussion as I don't want to be in
a position to compete with somebody else ongoing big project. It
might be counterproductive for MRI development. Especially I don't
want it because the project is big and long and probably will have a
lot of tehcnical obstacles and have a possibilty to be a failure.

History

#1 [ruby-core:76383] Updated by vmakarov (Vladimir Makarov) 8 months ago

  • Tracker changed from Bug to Feature

#2 [ruby-core:76386] Updated by matz (Yukihiro Matsumoto) 8 months ago

As for superoperators, shyouhei is working on it.

In any way, I'd suggest you take a YARV step for a big change like your proposal.
When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.
After he brushed it up to almost complete, we replaced the VM.

I think Koichi would help you.

Matz.

#3 [ruby-core:76396] Updated by vmakarov (Vladimir Makarov) 8 months ago

Yukihiro Matsumoto wrote:

As for superoperators, shyouhei is working on it.

In any way, I'd suggest you take a YARV step for a big change like your proposal.
When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.
After he brushed it up to almost complete, we replaced the VM.

Thank you for the advice. I investigate how to implement it as a C extension. Right now I just have a modified and hackish version of compile.c/insns.def of some year old Ruby version to get RTL code for the two cases I published. After getting some acceptable results I think I need to start to work more systematically and I would like to get some working prototype w/o AOT by the year end.

I think Koichi would help you.

That would be great.

#4 [ruby-core:76405] Updated by shyouhei (Shyouhei Urabe) 8 months ago

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

#5 [ruby-core:76408] Updated by vmakarov (Vladimir Makarov) 8 months ago

Shyouhei Urabe wrote:

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

Thank you for the link. The results you got are interesting to me. You could add a new insn 'send_and_pop' but I suspect it will give only a tiny performance improvement. Pop is a low cost insn and especially when it goes with the send insn. The only performance improvement will be pop insn dispatch savings (it is only 2 x86-64 insns). Still it will give a visible insn memory saving.

RTL insns are better fit for optimizations (it is most frequent IR for optimizing compilers) including combining insns (code selection) but finding frequent combinations is more complicated because insns being combined should be dependent, e.g. result of the first insn is used as an operand of the 2nd insn (combining independent insns will again save VM insn dispatching and may be will result in improving a fine-grain parallelism by compiler insn scheduler or by logic of out-of-order execution CPU). It could be an interesting research what RTL insns should be combined for a particular code. I don't remember any article about this.

#6 [ruby-core:76420] Updated by ko1 (Koichi Sasada) 8 months ago

Hi!

Do you have interst to visit Japan and discuss Japanese ruby committers?
If you have interst, I will ask someone to pay your travel fare.

Thanks,
Koichi

#7 [ruby-core:76450] Updated by vmakarov (Vladimir Makarov) 8 months ago

Shyouhei Urabe wrote:

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

Sorry, I realized that I quickly jump to generalization about insn combining and did not write you all details how to implement send-and-pop. It needs a flag in a call frame and the return insn should use it. But imho, send-and-pop implementation has no sense as benefit of removing dispatch machine insns is eaten by insns dealing with the flag (it also slows down the regular send insn). Also increasing size of the call frame means decreasing maximal recursion depth although with some tricks you can add the flag (it is just one bit) w/o increasing call frame size.

#8 [ruby-core:76475] Updated by naruse (Yui NARUSE) 8 months ago

Secondly, I'd like to combine some frequent insn sequences into
bigger insns. Again it decreases insn dispatch overhead and
memory traffic even more. Also it permits to remove some type checking.

The first thing on my mind is a sequence of a compare insn and a
branch and using immediate operands besides temporary (stack) and
local variables. Also it is not a trivial task for Ruby as the
compare can be implemented as a method.

I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:
https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7

If it can be written more simple and clean, it's worth to merge...

#9 [ruby-core:76481] Updated by vmakarov (Vladimir Makarov) 8 months ago

Yui NARUSE wrote:

Secondly, I'd like to combine some frequent insn sequences into
bigger insns. Again it decreases insn dispatch overhead and
memory traffic even more. Also it permits to remove some type checking.

The first thing on my mind is a sequence of a compare insn and a
branch and using immediate operands besides temporary (stack) and
local variables. Also it is not a trivial task for Ruby as the
compare can be implemented as a method.

I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:
https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7

If it can be written more simple and clean, it's worth to merge...

Thank you for the link. Yes, imho, the code is worth to merge. Although RTL insns potentially can give a better improvement, the ETA is not known and even their success is not guaranteed (as I wrote Ruby has a specific feature -- a lot of calls. And calls require work with parameters in stack order anyway).

Using compare and branch is a no-brainer. Many modern processors contain such insns. Actually CPUs can be an inspiring source for what insns to unify. Some CPUs have branch and increment, madd (multiply and add), etc.

Also available in: Atom PDF