Project

General

Profile

Feature #12589

VM performance improvement proposal

Added by vmakarov (Vladimir Makarov) 9 months ago. Updated 16 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:76382]

Description

Hello. I'd like to start a big MRI project but I don't want to
disrupt somebody else plans. Therefore I'd like to have MRI
developer's opinion on the proposed project or information if somebody
is already working on an analogous project.

Basically I want to improve overall MRI VM performance:

  • First of all, I'd like to change VM insns and move from
    stack-based insns to register transfer ones. The idea behind
    it is to decrease VM dispatch overhead as approximately 2 times
    less RTL insns are necessary than stack based insns for the same
    program (for Ruby it is probably even less as a typical Ruby program
    contains a lot of method calls and the arguments are passed through
    the stack).

    But decreasing memory traffic is even more important advantage
    of RTL insns as an RTL insn can address temporaries (stack) and
    local variables in any combination. So there is no necessity to
    put an insn result on the stack and then move it to a local
    variable or put variable value on the stack and then use it as an
    insn operand. Insns doing more also provide a bigger scope for C
    compiler optimizations.

    The biggest changes will be in files compile.c and insns.def (they
    will be basically rewritten). So the project is not a new VM
    machine. MRI VM is much more than these 2 files.

    The disadvantage of RTL insns is a bigger insn memory footprint
    (which can be upto 30% more) although as I wrote there are fewer
    number of RTL insns.

    Another disadvantage of RTL insns specifically for Ruby is that
    insns for call sequences will be basically the same stack based
    ones but only bigger as they address the stack explicitly.

  • Secondly, I'd like to combine some frequent insn sequences into
    bigger insns. Again it decreases insn dispatch overhead and
    memory traffic even more. Also it permits to remove some type
    checking.

    The first thing on my mind is a sequence of a compare insn and a
    branch and using immediate operands besides temporary (stack) and
    local variables. Also it is not a trivial task for Ruby as the
    compare can be implemented as a method.

I already did some experiments. RTL insns & combining insns permits
to speed the following micro-benchmark in more 2 times:

i = 0
while i<30_000_000 # benchmark loop 1
  i += 1
end

The generated RTL insns for the benchmark are

== disasm: #<ISeq:<main>@while.rb>======================================
== catch table
| catch type: break  st: 0007 ed: 0020 sp: 0000 cont: 0020
| catch type: next   st: 0007 ed: 0020 sp: 0000 cont: 0005
| catch type: redo   st: 0007 ed: 0020 sp: 0000 cont: 0007
|------------------------------------------------------------------------
local table (size: 2, temp: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] i
0000 set_local_val    2, 0                                            (   1)
0003 jump             13                                              (   2)
0005 jump             13
0007 plusi            <callcache>, 2, 2, 1, -1                        (   3)
0013 btlti            7, <callcache>, -1, 2, 30000000, -1             (   2)
0020 local_ret        2, 0                                            (   3)

In this experiment I ignored trace insns (that is another story) and a
complication that a integer compare insn can be re-implemented as a
Ruby method. Insn bflti is combination of LT immediate compare and
branch true.

A modification of fib benchmark is sped up in 1.35 times:

def fib_m n
  if n < 1
    1
  else
    fib_m(n-1) * fib_m(n-2)
  end
end

fib_m(40)

The RTL code of fib_m looks like

== disasm: #<ISeq:fib_m@fm.rb>==========================================
local table (size: 2, temp: 3, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] n<Arg>
0000 bflti            10, <callcache>, -1, 2, 1, -1                   (   2)
0007 val_ret          1, 16
0010 minusi           <callcache>, -2, 2, 1, -2                       (   5)
0016 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -1
0020 minusi           <callcache>, -3, 2, 2, -3
0026 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -2
0030 mult             <callcache>, -1, -1, -2, -1
0036 temp_ret         -1, 16

In reality, the improvement of most programs probably will be about
10%. That is because of very dynamic nature of Ruby (a lot of calls,
checks for redefinition of basic type operations, checking overflows
to switch to GMP numbers). For example, integer addition can not be
less than about x86-64 17 insns out of the current 50 insns on the
fast path. So even if you make the rest (33) insns 2 times faster,
the improvement will be only 30%.

A very important part of MRI performance improvement is to make calls
fast because there are a lot of them in Ruby but as I read in some
Koichi Sasada's presentations he pays a lot of attention to it. So I
don't want to touch it.

  • Thirdly. I want to implement the insns as small inline functions
    for future AOT compiler, of course, if the projects described
    above are successful. It will permit easy AOT generation of C code
    which will be basically calls of the functions.

    I'd like to implement AOT compiler which will generate a Ruby
    method code, call a C compiler to generate a binary shared code
    and load it into MRI for subsequent calls. The key is to minimize
    the compilation time. There are many approaches to do it but I
    don't want to discuss it right now.

    C generation is easy and most portable implementation of AOT but
    in future it is possible to use GCC JIT plugin or LLVM IR to
    decrease overhead of C scanner/parser.

    C compiler will see a bigger scope (all method insns) to do
    optimizations. I think using AOT can give another 10%
    improvement. It is not that big again because of dynamic nature
    of Ruby and any C compiler is not smart enough to figure out
    aliasing for typical generated C program.

    The life with the performance point of view would be easy if Ruby
    did not permit to redefine basic operations for basic types,
    e.g. plus for integer. In this case we could evaluate types of
    operands and results using some data flow analysis and generate
    faster specialized insns. Still a gradual typing if it is
    introduced in future versions of Ruby would help to generate such
    faster insns.

Again I wrote this proposal for discussion as I don't want to be in
a position to compete with somebody else ongoing big project. It
might be counterproductive for MRI development. Especially I don't
want it because the project is big and long and probably will have a
lot of tehcnical obstacles and have a possibilty to be a failure.

History

#1 [ruby-core:76383] Updated by vmakarov (Vladimir Makarov) 9 months ago

  • Tracker changed from Bug to Feature

#2 [ruby-core:76386] Updated by matz (Yukihiro Matsumoto) 9 months ago

As for superoperators, shyouhei is working on it.

In any way, I'd suggest you take a YARV step for a big change like your proposal.
When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.
After he brushed it up to almost complete, we replaced the VM.

I think Koichi would help you.

Matz.

#3 [ruby-core:76396] Updated by vmakarov (Vladimir Makarov) 9 months ago

Yukihiro Matsumoto wrote:

As for superoperators, shyouhei is working on it.

In any way, I'd suggest you take a YARV step for a big change like your proposal.
When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.
After he brushed it up to almost complete, we replaced the VM.

Thank you for the advice. I investigate how to implement it as a C extension. Right now I just have a modified and hackish version of compile.c/insns.def of some year old Ruby version to get RTL code for the two cases I published. After getting some acceptable results I think I need to start to work more systematically and I would like to get some working prototype w/o AOT by the year end.

I think Koichi would help you.

That would be great.

#4 [ruby-core:76405] Updated by shyouhei (Shyouhei Urabe) 9 months ago

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

#5 [ruby-core:76408] Updated by vmakarov (Vladimir Makarov) 9 months ago

Shyouhei Urabe wrote:

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

Thank you for the link. The results you got are interesting to me. You could add a new insn 'send_and_pop' but I suspect it will give only a tiny performance improvement. Pop is a low cost insn and especially when it goes with the send insn. The only performance improvement will be pop insn dispatch savings (it is only 2 x86-64 insns). Still it will give a visible insn memory saving.

RTL insns are better fit for optimizations (it is most frequent IR for optimizing compilers) including combining insns (code selection) but finding frequent combinations is more complicated because insns being combined should be dependent, e.g. result of the first insn is used as an operand of the 2nd insn (combining independent insns will again save VM insn dispatching and may be will result in improving a fine-grain parallelism by compiler insn scheduler or by logic of out-of-order execution CPU). It could be an interesting research what RTL insns should be combined for a particular code. I don't remember any article about this.

#6 [ruby-core:76420] Updated by ko1 (Koichi Sasada) 9 months ago

Hi!

Do you have interst to visit Japan and discuss Japanese ruby committers?
If you have interst, I will ask someone to pay your travel fare.

Thanks,
Koichi

#7 [ruby-core:76450] Updated by vmakarov (Vladimir Makarov) 9 months ago

Shyouhei Urabe wrote:

FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.

https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e

Sorry, I realized that I quickly jump to generalization about insn combining and did not write you all details how to implement send-and-pop. It needs a flag in a call frame and the return insn should use it. But imho, send-and-pop implementation has no sense as benefit of removing dispatch machine insns is eaten by insns dealing with the flag (it also slows down the regular send insn). Also increasing size of the call frame means decreasing maximal recursion depth although with some tricks you can add the flag (it is just one bit) w/o increasing call frame size.

#8 [ruby-core:76475] Updated by naruse (Yui NARUSE) 9 months ago

Secondly, I'd like to combine some frequent insn sequences into
bigger insns. Again it decreases insn dispatch overhead and
memory traffic even more. Also it permits to remove some type checking.

The first thing on my mind is a sequence of a compare insn and a
branch and using immediate operands besides temporary (stack) and
local variables. Also it is not a trivial task for Ruby as the
compare can be implemented as a method.

I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:
https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7

If it can be written more simple and clean, it's worth to merge...

#9 [ruby-core:76481] Updated by vmakarov (Vladimir Makarov) 9 months ago

Yui NARUSE wrote:

Secondly, I'd like to combine some frequent insn sequences into
bigger insns. Again it decreases insn dispatch overhead and
memory traffic even more. Also it permits to remove some type checking.

The first thing on my mind is a sequence of a compare insn and a
branch and using immediate operands besides temporary (stack) and
local variables. Also it is not a trivial task for Ruby as the
compare can be implemented as a method.

I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:
https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7

If it can be written more simple and clean, it's worth to merge...

Thank you for the link. Yes, imho, the code is worth to merge. Although RTL insns potentially can give a better improvement, the ETA is not known and even their success is not guaranteed (as I wrote Ruby has a specific feature -- a lot of calls. And calls require work with parameters in stack order anyway).

Using compare and branch is a no-brainer. Many modern processors contain such insns. Actually CPUs can be an inspiring source for what insns to unify. Some CPUs have branch and increment, madd (multiply and add), etc.

#10 [ruby-core:80415] Updated by vmakarov (Vladimir Makarov) about 1 month ago

I think I've reached a state of the project to make its current
code public. Most of the infrastructure for RTL insns and JIT has
been implemented.

Although I did a lot of performance experiments to choose the
current approach for the project, I did not focus at the performance
yet. I wanted to get more solid performance first before publishing
it. Unfortunately, I'll have no time for working on the project until
May because of GCC7 release. So to get some feedback I decided to
publish it earlier. Any comments, proposals, and questions are
welcomed.

You can find the code on
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch. Please, read
file README.md about the project first.

The HEAD of the branch
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base (currently
trunk as of Jan) is and will be always the last merge point of branch
rtl_mjit_branch with the trunk. To see all changes (the patch is big,
more 20K lines), you can use the following link

https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch

The project is still at very early stages. I am planning to spend
half of my work time on it at least for an year. I'll decide what to
do with the project in about year depending on where it is going to.

#11 [ruby-core:80417] Updated by normalperson (Eric Wong) about 1 month ago

vmakarov@redhat.com wrote:

I think I've reached a state of the project to make its current
code public. Most of the infrastructure for RTL insns and JIT has
been implemented.

Thank you for the update! I was just rereading this thread
last night (or was it today? I can't tell :<). Anyways I will
try to look more deeply at this in a week or two.

#12 [ruby-core:80444] Updated by subtileos (Daniel Ferreira) about 1 month ago

I think I've reached a state of the project to make its current
code public. Most of the infrastructure for RTL insns and JIT has
been implemented.

Hi Vladimir,

Thank you very much for this post.
That README is priceless.
It is wonderful the kind of work you are doing with such a degree of
entry level details.
I believe that ruby core gets a lot from public posts like yours.
This sort of posts and PR's are the ones that I miss sometimes in
order to be able to understand in better detail the why's of doing
something in one way or another in terms of ruby core implementation.
In the README you explain very well all the surroundings around your
choices and the possibilities.
That makes me believe there may be space for collaboration from
someone that is willing to get deeper into the C level code.
If there is anyway I can be helpful please say so.

Once again thank you very much and keep up with your excellent
contribution making available to the rest of us the same level of
detail and conversation as much as possible.

Regards,

Daniel

P.S.

I was waiting a little bit to see the amount of reception this post
would have and surprisingly only Eric replied to you.
Why is that?

#13 [ruby-core:80445] Updated by subtileos (Daniel Ferreira) about 1 month ago

Hi Vladimir,

On Tue, Mar 28, 2017 at 4:26 AM, vmakarov@redhat.com wrote:

You can find the code on
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch. Please, read
file README.md about the project first.

Thank you very much for this post.
That README is priceless.
It is wonderful the kind of work you are doing with such a degree of
entry level details.
I believe that ruby core gets a lot from public posts like yours.
This sort of posts and PR's are the ones that I miss sometimes in
order to be able to understand in better detail the why's of doing
something in one way or another in terms of ruby core implementation.

The HEAD of the branch
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base (currently
trunk as of Jan) is and will be always the last merge point of branch
rtl_mjit_branch with the trunk. To see all changes (the patch is big,
more 20K lines), you can use the following link

https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch

What kind of feedback are you looking forward to get?
Can I help in any way?
Is the goal to try to compile your branch and get specific information
from the generated ruby?
If so what kind of information?

The project is still at very early stages. I am planning to spend
half of my work time on it at least for an year. I'll decide what to
do with the project in about year depending on where it is going to.

In the README you explain very well all the surroundings around your
choices and the possibilities.
That makes me believe there may be space for collaboration from
someone that is willing to get deeper into the C level code.
If there is anyway I can be helpful please say so.

Once again thank you very much and keep up with your excellent
contribution making available to the rest of us the same level of
detail and conversation as much as possible.

Regards,

Daniel

P.S.

I was waiting a little bit to see the amount of reception this post
would have and surprisingly only Eric replied to you.
Why is that?

#14 [ruby-core:80488] Updated by vmakarov (Vladimir Makarov) 30 days ago

subtileos (Daniel Ferreira) wrote:

Hi Vladimir,

On Tue, Mar 28, 2017 at 4:26 AM, vmakarov@redhat.com wrote:

You can find the code on
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch. Please, read
file README.md about the project first.

Thank you very much for this post.

You are welcomed.

That README is priceless.
It is wonderful the kind of work you are doing with such a degree of
entry level details.
I believe that ruby core gets a lot from public posts like yours.
This sort of posts and PR's are the ones that I miss sometimes in
order to be able to understand in better detail the why's of doing
something in one way or another in terms of ruby core implementation.

The HEAD of the branch
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base (currently
trunk as of Jan) is and will be always the last merge point of branch
rtl_mjit_branch with the trunk. To see all changes (the patch is big,
more 20K lines), you can use the following link

https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch

What kind of feedback are you looking forward to get?

My approach to JIT is not traditional. I believe that implementing JIT in MRI should be more evolutional to be successful. The changes should be minimized but other ways should be still open. My second choice would be a specialized JIT with 3-4 faster compilation speed like luajit but it is in order magnitude bigger project (probably even more) than the current approach and for sure has a bigger chance to fail at the end. So the discussion of the current and other approaches would be helpful for me to better understand how reasonable my current approach is.

Another thing is to avoid a work duplication. My very first post in this thread was to figure out if somebody is already working on something analogous. I did not get an exact confirmation that I am doing a duplicative work. So I went ahead with the project.

For people who works on a Ruby JIT openly or in secret, posting info about my project would be helpful. At least my investigation of Oracle Graal and IBM OMR was very helpful.

Also I am pretty new to MRI sources (I started to work on it just about year ago). I found that MRI lacks documentation and comments. There is no document like GCC internals which could be helpful for a newbie. So I might be doing stupid things which can be done easier and I might not be following some implicit source code policies.

Can I help in any way?
Is the goal to try to compile your branch and get specific information
from the generated ruby?
If so what kind of information?

Trying the branch and informing what you like or don't like would be helpful. It could be anything, e.g. insn names. As I wrote RTL insns should work for serious Ruby programs. I definitely can not say the same about JIT. Still there is a chance that RTL breaks some code. Also RTL code might be slower because not all edge cases are implemented with the same level optimization as stack code (e.g. multiple assignment) and some Ruby code can be better fit to the stack insns. It would be interesting to see such code.

MJIT is at very early stages of development. I think it will have a big chance to be successful if I achieve inlining on the path RUBY->C->Ruby for a reasonable compilation time. But even implementing this will not speed some Ruby code considerably (e.g. floating point benchmarks can not be speed up without changing representation of double/VALUE in MRI).

The project is still at very early stages. I am planning to spend
half of my work time on it at least for an year. I'll decide what to
do with the project in about year depending on where it is going to.

In the README you explain very well all the surroundings around your
choices and the possibilities.

I omitted a few other pros and cons of the choices.

That makes me believe there may be space for collaboration from
someone that is willing to get deeper into the C level code.
If there is anyway I can be helpful please say so.

Thank you. I guess if I get some considerable performance improvement, some help can be useful for more or less independent works. But unfortunately, I am not at this stage yet. I hope to get performance improvements I expect in a half year.

Once again thank you very much and keep up with your excellent
contribution making available to the rest of us the same level of
detail and conversation as much as possible.

Thank you for kind words, Daniel and Eric.

I was waiting a little bit to see the amount of reception this post
would have and surprisingly only Eric replied to you.
Why is that?

I think people need some time to evaluate the current state of the project and perspectives. It is not a traditional approach to JIT. This is what at least I would do myself. There are a lot of details in the new code. I would spend time to read sources to understand the approach better. And usually the concerned people are very busy. So it might need a few weeks.

#15 [ruby-core:80495] Updated by vmakarov (Vladimir Makarov) 29 days ago

Sorry, Matthew. I can not find your message on
https://bugs.ruby-lang.org/issues/12589. So I am sending this message
through email.

On 03/29/2017 04:36 PM, Matthew Gaudet wrote:

Hi Vladimir,

First and foremost, let me join in with others in thanking you for
opening up your experimentation. I suspect that you'd be one of the
'secret' Ruby JITs Chris Seaton was talking about [1]. One more secret
JIT to go :)

Thank you. I would not call it a secret. I wrote about it couple times
publicly. But it was a quite development. This is my first major update
about the project.

I believe that implementing JIT in MRI should be more evolutional to
be successful.

[...]

Another thing is to avoid a work duplication.

So far, evolutionary approaches have heavily dominated the work we've
done with Ruby+OMR as well. I also recently wrote an article about what
needs to happen with Ruby+OMR [2]. One thing in that article I want to
call out is my belief that those of us working on JIT compilers for MRI
have many opportunities to share ideas, implementation and features.
My hope is that we can all keep each other in mind when working on
things.

I read your article. It was helpful. And I am agree with you about
sharing the ideas.
I haven't had a huge amount of time to go through your patches, though,
I have gone through some of it. One comment I would make is that it
seems you've got two very separate projects here: One is a re-design of
YARV as an RTL machine, and the other is MJIT, your JIT that takes
advantage of the structure of the RTL instructions. In my opinion, it is
worth considering these two projects separately. My (offhand) guess
would be that I could adapt Ruby+OMR to consume the RTL instructions in
a couple of weeks, and other (secret) JITs may be in a similar place.

Yes, may be you are right about separating the project. For me it is
just one project. I don't see MJIT development without RTL. I'll need a
program analysis and RTL is more adequate approach for this than stack
insns.

Your approach to MJIT certainly seems interesting. I was quite
impressed with the compile times you mentioned -- when I was first
thinking about your approach I had thought they would be quite a bit
higher.

One question I have (and this is largely for the Ruby community to
answer) is about how to measure impacts from JITs on non-performance
metrics. In this case for example, should MJIT's dynamic memory
footprint be computed as the total of the Ruby process and GCC, or
can we maybe ignore the GCC costs -- at a cost to compilation time you
could do the compiles elsewhere, and you have a clear path to
Ahead-of-Time compilation in your mind.

Yes we should measure memory footprint too to compare different JITs.

MJIT code itself is currently very small about 40KB. GCC code is
pretty big about 20MB (LLVM library is even bigger) but code of multiple
running instances of GCC (even hundred of them) will have the same 20MB
in memory at least on Linux.

The data created in GCC is more important. GCC is not monstrous. As any
optimizing compiler, it works by passes (in GCC there are more 300 of
them): a pass gets IR, allocates the pass data, transforms IR, and frees
the data. So the peak consumption is not big. I'd say the peak
consumption for typical ISEQ with the compiled environment would be
about couple megabytes.

GCC developers really care about data consumption and compiler speed.

There are some passes (GCSE and RA) which consume a lot of data
(sometimes the data consumption is quadratic of IR size). Still GCC is
every tunable and such behaviour can be avoided with particular options
and parameters. I suspect, other JIT implementations will have
analogous memory footprint for the data if they do inlining.

My recollection is that one of the reasons rujit was abandoned was
because its memory footprint was considered unacceptable, but, I don't
don't know how that conclusion was drawn.

It would be interesting to know all reasons why rujit was abandoned. I
suspect it was more than just the data consumption.

You can not implement JIT without consuming additional memory. May be
for some MRI environments like Heroku the additional memory consumption
can be critical. And for such environment it might be better not to use
JIT at all. Still there are other Ruby environments where people can
spare memory consumption for faster code.

At least my investigation of Oracle Graal and IBM OMR was very helpful.

Glad we could help. One small note: The project is called Eclipse OMR,
not IBM OMR. While IBM is the largest contributor right now, we're
trying to build a community around the project, and it is run through
the Eclipse foundation.

Thanks for the clarification.

I can also share my finding about Ruby OMR. I found that Ruby OMR is
one thread program. So MRI waits for OMR to get a machine code and it
hurts the performance. I think the compilation should be done in
parallel with code execution in the interpreter as for Graal or JVM.

[1]: https://twitter.com/ChrisGSeaton/status/811303853488488448
[2]:
https://developer.ibm.com/open/2017/03/01/ruby-omr-jit-compiler-whats-next/

#16 [ruby-core:80497] Updated by subtileos (Daniel Ferreira) 29 days ago

Hi Matthew,

https://developer.ibm.com/open/2017/03/01/ruby-omr-jit-compiler-whats-next/

I was reading your article, and I would like to say that what you
present there is just fantastic in my point of view.
Why fantastic? Because having IBM embracing Ruby in that way can only
give Ruby a brilliant future.
We have IBM and Oracle and Heroku and Redhat. How many companies more
besides Japan (which also should be better exposed)? It is not just
some developers. This is a powerful message for the world community
and in my opinion Ruby needs to clearly present it to the wider
audience.

This pleases me because I'm totally Ruby biased (for the better and the worst).
(For me Ruby should be used everywhere. Even as a replacement for
javascript. Opal needs more emphasis. I just love it.)

Ever since I heard about Ruby 3x3 in Matz announcement that I clearly
saw it would be a major opportunity for Ruby to stand out from the
crowd. A genius marketing move that well coordinated could have a very
important impact in the coming future regarding the dynamic languages
current competitive ecosystem.

I want to be part of it and have been trying to find a way to do that.
This is the reason I asked Vladimir what help could he be using from me.
I even asked about Ruby 3x3 to Eric regarding my symbols thread which
is not dead.

It is also great that you agree that there is much room for collaboration.
I'm a newbie in terms of compilers and JITs and all that jazz but I'm
willing to dig in and learn as much as possible and contribute the
better I can.

For me it doesn't matter in what project.
What is important for me is a collaborative environment where we can
communicate and learn things step-by-step throughout the way which
seems what you have in your mind to offer.

Very glad you are creating the eclipse community.

You ask there what would be the best way to build that community.
I have a suggestion: Consider doing it by sharing the discussions with
ruby-core like Vladimir is doing.
I was totally unaware of your current work if it not for this thread
(I thought OMR was still closed code).
Anyone that do care about Ruby development subscribes to ruby-core.

I believe I can help also in terms of organisation.
I have clear ideas on how to improve ruby regarding communication and
documentation.
And I'm very focused on architecture logic speaking about web
development and DevOps but software design as a all.
I'm pretty sure I will learn tones working with you and being part of
this endeavour but I can bring some added value in that regard.

Like Vladimir said Ruby lacks a way for new people to come on board in
an easy way. When I develop code I always pay lots of emphasis to the
files organisation and design patterns being put in place, the tests
and documentation so that it can be always easy to understand the
architecture and reasons some options have been made.

Ruby 3x3 is for me a big opportunity to look at that problem and try
to put some architecture documents in place.

This implies that for me each one of this projects should work in
close form with ruby core developers. Again a reason to have OMR
directly linked to ruby core issue tracker.

You mention as well that the existence of multiple JIT projects and
that competition can only bring good things to Ruby itself.
Couldn't agree more. Important for me is to not let this projects to die.
One of the great things the ruby community has, is that ability to
make each developer feel at home.
Matz was able to build that throughout the time.

Let me hear your thoughts on the matter.
If you are ready to bring me on board I'm ready to step in.

A note on that regard is that all my contribution for now will need to
be out of work hours.
But in the future maybe I can convince my company to sponsorship me.
No promise as I didn't speak with them yet.

Regards,

Daniel

P.S.

(This text is pretty much some scattered thoughts but I will send it
as it is anyway. Have so much things to say that I'm afraid if I start
to better structure the text it will become to big for someone to
read)

P.S.2

Sorry Vladimir for replying to Matthew on your thread. But I'm doing
it to emphasise how much I think we should work together on this
matter. (I could have sent a private email, think it is much better
this way)

#17 [ruby-core:80513] Updated by magaudet (Matthew Gaudet) 28 days ago

vmakarov (Vladimir Makarov) wrote:

Sorry, Matthew. I can not find your message on
https://bugs.ruby-lang.org/issues/12589. So I am sending this message
through email.

Very curious! I don't quite know what went wrong... so here I am writing
a reply in Redmine to make sure it shows up for future searchers :)

I read your article. It was helpful. And I am agree with you about
sharing the ideas.

Glad to hear it. Let me know if there's any feature you'd like to see implemented
that you'd like collaboration on. I've already submitted a patch for one feature
we expect to be useful in the future (https://bugs.ruby-lang.org/issues/13265),
and would be interested in helping to do more if desired.

Yes, may be you are right about separating the project. For me it is
just one project. I don't see MJIT development without RTL. I'll need a
program analysis and RTL is more adequate approach for this than stack
insns.

I totally understand. Especially for you, I can see how RTL feels like almost
a means-to-an-end; I would just encourage you (and others in the Ruby community)
to think of them separately, as if RTL is superior, it would be a shame to lose
that progress if MJIT doesn't satisfy all its goals.

Yes we should measure memory footprint too to compare different JITs.

MJIT code itself is currently very small about 40KB. GCC code is
pretty big about 20MB (LLVM library is even bigger) but code of multiple
running instances of GCC (even hundred of them) will have the same 20MB
in memory at least on Linux.

The data created in GCC is more important. GCC is not monstrous. As any
optimizing compiler, it works by passes (in GCC there are more 300 of
them): a pass gets IR, allocates the pass data, transforms IR, and frees
the data. So the peak consumption is not big. I'd say the peak
consumption for typical ISEQ with the compiled environment would be
about couple megabytes.

Kudos to the GCC developers (yourself included). That seems eminently reasonable.

You can not implement JIT without consuming additional memory. May be
for some MRI environments like Heroku the additional memory consumption
can be critical. And for such environment it might be better not to use
JIT at all. Still there are other Ruby environments where people can
spare memory consumption for faster code.

Indeed. I spoke at Ruby Kaigi 2016 1 trying very hard to encouraging thinking
about exactly what it is that 3x3 should accomplish, and how to measure. As
I am sure you are aware, the selection of benchmark and benchmarking methodology
is key to making sure you actually achieve your aims.

I can also share my finding about Ruby OMR. I found that Ruby OMR is
one thread program. So MRI waits for OMR to get a machine code and it
hurts the performance. I think the compilation should be done in
parallel with code execution in the interpreter as for Graal or JVM.

Absolutely agree. It's an item we've opened 2, but just haven't gotten around
to implementing.

#18 [ruby-core:80514] Updated by vmakarov (Vladimir Makarov) 28 days ago

magaudet (Matthew Gaudet) wrote:

You can not implement JIT without consuming additional memory. May be
for some MRI environments like Heroku the additional memory consumption
can be critical. And for such environment it might be better not to use
JIT at all. Still there are other Ruby environments where people can
spare memory consumption for faster code.

Indeed. I spoke at Ruby Kaigi 2016 [1] trying very hard to encouraging thinking
about exactly what it is that 3x3 should accomplish, and how to measure. As
I am sure you are aware, the selection of benchmark and benchmarking methodology
is key to making sure you actually achieve your aims.

[1]: http://rubykaigi.org/2016/presentations/MattStudies.html

By the way, I did some memory consumption measurements using size of
max (peak) resident area for a small Ruby program (about 15 lines) and
its sub-processes on free x86-64 machine with 32GB memory using
(j)ruby --disable-gems. Here are the numbers:

Ruby trunk:                                 6.4MB
RTL:                                        6.5MB
RTL+GCC JIT:                               26.9MB
RTL+LLVM JIT:                              52.1MB
OMR:                                        6.5MB
OMR+JIT:                                   18.0MB
jruby:                                    244.5MB
Graal:                                    771.0MB

It can give a rough idea what JIT memory consumption costs are.

The numbers should be taken with a grain of salt. It includes
all code size too. As I wrote multiple running program copies
share the code. And in case of GCC (cc1) it is about 20MB (so
in a case of 20 running GCC on a server, the average size of max
resident area could be about 7.9MB).

I have no idea what is the code size of jruby and Graal as they
use sub-processes and I know nothing about them.

#19 [ruby-core:80616] Updated by normalperson (Eric Wong) 20 days ago

vmakarov@redhat.com wrote:

https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch

I've only taken a light look at it; but I think RTL seems
interesting promise. I needed the following patch to remove
"restrict" to compile on Debian stable:

https://80x24.org/spew/20170408231647.8664-1-e@80x24.org/raw

I also noted some rubyspec failures around break/while loops which
might be RTL related (make update-rubyspec && make test-rubyspec):

https://80x24.org/spew/20170408231930.GA11999@starla/

(The Random.urandom can be ignored since you're on an old version)

I haven't tried JIT, yet, as I'm already unhappy with current
Ruby memory usage; but if RTL alone can provide small speed
improvements without significant footprint I can deal with it.

I'm currently running dtas-player with RTL to play music and it
seems fine https://80x24.org/dtas/

Thanks.

Disclaimer: I do not use proprietary software (including JS) or
GUI browsers; so I use "git fetch vnmakarov" and other
normal git commands to fetch your changes after having the following
entry in .git/config:

[remote "vnmakarov"]
    fetch = +refs/heads/*:refs/remotes/vnmakarov/*
    url = https://github.com/vnmakarov/ruby.git

#20 [ruby-core:80620] Updated by vmakarov (Vladimir Makarov) 19 days ago

normalperson (Eric Wong) wrote:

vmakarov@redhat.com wrote:

https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch

I've only taken a light look at it; but I think RTL seems
interesting promise. I needed the following patch to remove
"restrict" to compile on Debian stable:

https://80x24.org/spew/20170408231647.8664-1-e@80x24.org/raw

I also noted some rubyspec failures around break/while loops which
might be RTL related (make update-rubyspec && make test-rubyspec):

https://80x24.org/spew/20170408231930.GA11999@starla/

(The Random.urandom can be ignored since you're on an old version)

Thank you for your feedback, Eric. I'll work on issues you found.

So far I spent about 80% of my MRI efforts on RTL. But probably it was because of the learning curve. I did not try RTL on serious Ruby applications yet. On small benchmarks, I got from 0% to 100% (for a simple while loop) improvement. I'd say the average improvement could be 10%. MRI has too many calls on which majority of time spent. So savings on less insn dispatching and memory traffic have a small impact. In some cases RTL can be even worse. For example, o.m(a1, a2, a3) has the following stack insns and RTL insns:

  push <o index>
  push <a1 index>
  push <a2 index>
  push <a3 index>
  send <callinfo> <cache>
  loc2temp -2, <a1 index>
  loc2temp -3, <a2 index>
  loc2temp -4, <a3 index>
  call_recv <call data>, <o index>, -1

RTL insns are 18% longer for this example. I am going to investigate what the overall length of executed stack insns vs RTL insns when I resume my work on the project.

I haven't tried JIT, yet, as I'm already unhappy with current
Ruby memory usage; but if RTL alone can provide small speed
improvements without significant footprint I can deal with it.

I believe there would be no additional footprint for RTL insn or there would an insignificant increase (1-2%).

JIT is ready only for small benchmarks right now. My big worry is in using exec wrapper when we go from JITed code execution to interpreted code execution to another JITed code and so on. It might increase stack usage. But I am going to work on removing exec wrapper usage in some cases.

If you are not happy with the current MRI memory footprint, you will be definitely unhappy with any JIT because their work will require much more peak memory (at least in order of magnitude) than the current MRI footprint.

But I think with my approach I can use much less memory and CPUs (JITs might require more CPU usage because of the compilations) than jruby or Graal. My JIT will also have no startup delay which is huge for jruby and Graal. Still achieving a better performance (wall clock execution) should be the first priority of my JIT project.

By the way, I forgot to mention that my approach also opens a possibility in future to distribute gems in C code without binaries and it might help gems portability.

I'm currently running dtas-player with RTL to play music and it
seems fine https://80x24.org/dtas/

Great! Thank you for sharing this.

#21 [ruby-core:80626] Updated by normalperson (Eric Wong) 19 days ago

vmakarov@redhat.com wrote:

**stack-based** insns to **register transfer** ones.  The idea behind
it is to decrease VM dispatch overhead as approximately 2 times
less RTL insns are necessary than stack based insns for the same
program (for Ruby it is probably even less as a typical Ruby program
contains a lot of method calls and the arguments are passed through
the stack).

But *decreasing memory traffic* is even more important advantage
of RTL insns as an RTL insn can address temporaries (stack) and
local variables in any combination.  So there is no necessity to
put an insn result on the stack and then move it to a local
variable or put variable value on the stack and then use it as an
insn operand.  Insns doing more also provide a bigger scope for C
compiler optimizations.

One optimization I'd like to add while remaining 100% compatible
with existing code is to add a way to annotate read-only args for
methods (at least those defined in C-API). That will allow
delaying putstring instructions and giving them the same effect
as putobject.

This would require having visibility into the resolved method
at runtime; before putting its args on the stack.

One trivial example would be the following, where
String#start_with? has been annotated(*) with the args being
read-only:

foo.start_with?("/")

Instead of resolving the 'putstring "/"', first;
the method "start_with?" is resolved.

If start_with? is String#start_with? with a constant
annotation(*) for the arg(s); the 'putstring "/"'
instruction returns the string w/o resurrecting it
to avoid the allocation.

This would be a more generic way of doing things like
opt_aref_with/opt_aset_with; but without adding more global
redefinition flags.

(*) Defining a method may change from:

rb_define_method(rb_cString, "start_with?", rb_str_start_with, -1);

To something like:

rb_define_method2(rb_cString, "start_with?", rb_str_start_with,
"RO(*prefixes)");

But rb_define_method should continue to work as-is for old code;
but having a new rb_define_method2 would also allow us to fix
current inefficiencies in rb_scan_args and rb_get_kwargs.

#22 [ruby-core:80661] Updated by vmakarov (Vladimir Makarov) 16 days ago

normalperson (Eric Wong) wrote:

One optimization I'd like to add while remaining 100% compatible
with existing code is to add a way to annotate read-only args for
methods (at least those defined in C-API). That will allow
delaying putstring instructions and giving them the same effect
as putobject.

Your idea is interesting. I guess the optimization would be very useful and help MRI memory system.

I'll think too how to implement it with RTL insns.

I wanted to try new call insns where the call arguments are call insn parameters, e.g. call2 recv, arg1, arg2 where recv, arg1, arg2 are location indexes or even values. If it works with the performance point of view, the optimization implementation would be pretty straightforward.

Also available in: Atom PDF