Ruby master - Feature #12589: VM performance improvement proposal</h1> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-18T03:25:37Z</p> <ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Feature</i></li></ul> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-18T04:04:14Z</p> <ul></ul><p>As for superoperators, shyouhei is working on it.</p> <p>In any way, I'd suggest you take a YARV step for a big change like your proposal.<br> When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.<br> After he brushed it up to almost complete, we replaced the VM.</p> <p>I think Koichi would help you.</p> <p>Matz.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-18T14:24:35Z</p> <ul></ul><p>Yukihiro Matsumoto wrote:</p> <blockquote> <p>As for superoperators, shyouhei is working on it.</p> <p>In any way, I'd suggest you take a YARV step for a big change like your proposal.<br> When the early stage of the development of YARV, Koichi created his virtual machine as a C extension.<br> After he brushed it up to almost complete, we replaced the VM.</p> </blockquote> <p>Thank you for the advice. I investigate how to implement it as a C extension. Right now I just have a modified and hackish version of compile.c/insns.def of some year old Ruby version to get RTL code for the two cases I published. After getting some acceptable results I think I need to start to work more systematically and I would like to get some working prototype w/o AOT by the year end.</p> <blockquote> <p>I think Koichi would help you.</p> </blockquote> <p>That would be great.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-19T01:17:29Z</p> <ul></ul><p>FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.</p> <p><a href="https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e" class="external">https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e</a></p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-19T02:26:56Z</p> <ul></ul><p>Shyouhei Urabe wrote:</p> <blockquote> <p>FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.</p> <p><a href="https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e" class="external">https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e</a></p> </blockquote> <p>Thank you for the link. The results you got are interesting to me. You could add a new insn 'send_and_pop' but I suspect it will give only a tiny performance improvement. Pop is a low cost insn and especially when it goes with the send insn. The only performance improvement will be pop insn dispatch savings (it is only 2 x86-64 insns). Still it will give a visible insn memory saving.</p> <p>RTL insns are better fit for optimizations (it is most frequent IR for optimizing compilers) including combining insns (code selection) but finding frequent combinations is more complicated because insns being combined should be dependent, e.g. result of the first insn is used as an operand of the 2nd insn (combining independent insns will again save VM insn dispatching and may be will result in improving a fine-grain parallelism by compiler insn scheduler or by logic of out-of-order execution CPU). It could be an interesting research what RTL insns should be combined for a particular code. I don't remember any article about this.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-19T07:12:16Z</p> <ul></ul><p>Hi!</p> <p>Do you have interst to visit Japan and discuss Japanese ruby committers?<br> If you have interst, I will ask someone to pay your travel fare.</p> <p>Thanks,<br> Koichi</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-19T14:18:09Z</p> <ul></ul><p>Shyouhei Urabe wrote:</p> <blockquote> <p>FYI in current instruction set, there do exist bias between which instruction tends to follow which. A preexperimental result linked below shows there is clear tendency that a pop tends to follow a send. Not sure how to "fix" this though.</p> <p><a href="https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e" class="external">https://gist.github.com/anonymous/7ce9cb03b5bc6cfe6f96ec6c4940602e</a></p> </blockquote> <p>Sorry, I realized that I quickly jump to generalization about insn combining and did not write you all details how to implement send-and-pop. It needs a flag in a call frame and the return insn should use it. But imho, send-and-pop implementation has no sense as benefit of removing dispatch machine insns is eaten by insns dealing with the flag (it also slows down the regular send insn). Also increasing size of the call frame means decreasing maximal recursion depth although with some tricks you can add the flag (it is just one bit) w/o increasing call frame size.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-20T11:22:59Z</p> <ul></ul><blockquote> <p>Secondly, I'd like to combine some frequent insn sequences into<br> bigger insns. Again it decreases insn dispatch overhead and<br> memory traffic even more. Also it permits to remove some type checking.</p> <p>The first thing on my mind is a sequence of a compare insn and a<br> branch and using immediate operands besides temporary (stack) and<br> local variables. Also it is not a trivial task for Ruby as the<br> compare can be implemented as a method.</p> </blockquote> <p>I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:<br> <a href="https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7" class="external">https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7</a></p> <p>If it can be written more simple and clean, it's worth to merge...</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2016-07-20T23:24:08Z</p> <ul></ul><p>Yui NARUSE wrote:</p> <blockquote> <blockquote> <p>Secondly, I'd like to combine some frequent insn sequences into<br> bigger insns. Again it decreases insn dispatch overhead and<br> memory traffic even more. Also it permits to remove some type checking.</p> <p>The first thing on my mind is a sequence of a compare insn and a<br> branch and using immediate operands besides temporary (stack) and<br> local variables. Also it is not a trivial task for Ruby as the<br> compare can be implemented as a method.</p> </blockquote> <p>I tried to unify "a sequence of a compare insn and a branch" as follows but 1.2x speed up:<br> <a href="https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7" class="external">https://github.com/nurse/ruby/commit/a0e8fe14652dbc0a9b830fe84c5db85378accfb7</a></p> <p>If it can be written more simple and clean, it's worth to merge...</p> </blockquote> <p>Thank you for the link. Yes, imho, the code is worth to merge. Although RTL insns potentially can give a better improvement, the ETA is not known and even their success is not guaranteed (as I wrote Ruby has a specific feature -- a lot of calls. And calls require work with parameters in stack order anyway).</p> <p>Using compare and branch is a no-brainer. Many modern processors contain such insns. Actually CPUs can be an inspiring source for what insns to unify. Some CPUs have branch and increment, madd (multiply and add), etc.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-28T03:26:17Z</p> <ul></ul><p>I think I've reached a state of the project to make its current<br> code public. Most of the infrastructure for RTL insns and JIT has<br> been implemented.</p> <p>Although I did a lot of performance experiments to choose the<br> current approach for the project, I did not focus at the performance<br> yet. I wanted to get more solid performance first before publishing<br> it. Unfortunately, I'll have no time for working on the project until<br> May because of GCC7 release. So to get some feedback I decided to<br> publish it earlier. Any comments, proposals, and questions are<br> welcomed.</p> <p>You can find the code on<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch</a>. Please, read<br> file README.md about the project first.</p> <p>The HEAD of the branch<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base</a> (currently<br> trunk as of Jan) is and will be always the last merge point of branch<br> rtl_mjit_branch with the trunk. To see all changes (the patch is big,<br> more 20K lines), you can use the following link</p> <p><a href="https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch</a></p> <p>The project is still at very early stages. I am planning to spend<br> half of my work time on it at least for an year. I'll decide what to<br> do with the project in about year depending on where it is going to.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-28T04:21:40Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>I think I've reached a state of the project to make its current<br> code public. Most of the infrastructure for RTL insns and JIT has<br> been implemented.</p> </blockquote> <p>Thank you for the update! I was just rereading this thread<br> last night (or was it today? I can't tell :<). Anyways I will<br> try to look more deeply at this in a week or two.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-29T04:12:05Z</p> <ul></ul><blockquote> <p>I think I've reached a state of the project to make its current<br> code public. Most of the infrastructure for RTL insns and JIT has<br> been implemented.</p> </blockquote> <p>Hi Vladimir,</p> <p>Thank you very much for this post.<br> That README is priceless.<br> It is wonderful the kind of work you are doing with such a degree of<br> entry level details.<br> I believe that ruby core gets a lot from public posts like yours.<br> This sort of posts and PR's are the ones that I miss sometimes in<br> order to be able to understand in better detail the why's of doing<br> something in one way or another in terms of ruby core implementation.<br> In the README you explain very well all the surroundings around your<br> choices and the possibilities.<br> That makes me believe there may be space for collaboration from<br> someone that is willing to get deeper into the C level code.<br> If there is anyway I can be helpful please say so.</p> <p>Once again thank you very much and keep up with your excellent<br> contribution making available to the rest of us the same level of<br> detail and conversation as much as possible.</p> <p>Regards,</p> <p>Daniel</p> <p>P.S.</p> <p>I was waiting a little bit to see the amount of reception this post<br> would have and surprisingly only Eric replied to you.<br> Why is that?</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-29T04:32:08Z</p> <ul></ul><p>Hi Vladimir,</p> <p>On Tue, Mar 28, 2017 at 4:26 AM, <a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>You can find the code on<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch</a>. Please, read<br> file README.md about the project first.</p> </blockquote> <p>Thank you very much for this post.<br> That README is priceless.<br> It is wonderful the kind of work you are doing with such a degree of<br> entry level details.<br> I believe that ruby core gets a lot from public posts like yours.<br> This sort of posts and PR's are the ones that I miss sometimes in<br> order to be able to understand in better detail the why's of doing<br> something in one way or another in terms of ruby core implementation.</p> <blockquote> <p>The HEAD of the branch<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base</a> (currently<br> trunk as of Jan) is and will be always the last merge point of branch<br> rtl_mjit_branch with the trunk. To see all changes (the patch is big,<br> more 20K lines), you can use the following link</p> <p><a href="https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch</a></p> </blockquote> <p>What kind of feedback are you looking forward to get?<br> Can I help in any way?<br> Is the goal to try to compile your branch and get specific information<br> from the generated ruby?<br> If so what kind of information?</p> <blockquote> <p>The project is still at very early stages. I am planning to spend<br> half of my work time on it at least for an year. I'll decide what to<br> do with the project in about year depending on where it is going to.</p> </blockquote> <p>In the README you explain very well all the surroundings around your<br> choices and the possibilities.<br> That makes me believe there may be space for collaboration from<br> someone that is willing to get deeper into the C level code.<br> If there is anyway I can be helpful please say so.</p> <p>Once again thank you very much and keep up with your excellent<br> contribution making available to the rest of us the same level of<br> detail and conversation as much as possible.</p> <p>Regards,</p> <p>Daniel</p> <p>P.S.</p> <p>I was waiting a little bit to see the amount of reception this post<br> would have and surprisingly only Eric replied to you.<br> Why is that?</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-29T17:06:32Z</p> <ul></ul><p>subtileos (Daniel Ferreira) wrote:</p> <blockquote> <p>Hi Vladimir,</p> <p>On Tue, Mar 28, 2017 at 4:26 AM, <a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>You can find the code on<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch</a>. Please, read<br> file README.md about the project first.</p> </blockquote> <p>Thank you very much for this post.</p> </blockquote> <p>You are welcomed.</p> <blockquote> <p>That README is priceless.<br> It is wonderful the kind of work you are doing with such a degree of<br> entry level details.<br> I believe that ruby core gets a lot from public posts like yours.<br> This sort of posts and PR's are the ones that I miss sometimes in<br> order to be able to understand in better detail the why's of doing<br> something in one way or another in terms of ruby core implementation.</p> <blockquote> <p>The HEAD of the branch<br> <a href="https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base" class="external">https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch_base</a> (currently<br> trunk as of Jan) is and will be always the last merge point of branch<br> rtl_mjit_branch with the trunk. To see all changes (the patch is big,<br> more 20K lines), you can use the following link</p> <p><a href="https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch</a></p> </blockquote> <p>What kind of feedback are you looking forward to get?</p> </blockquote> <p>My approach to JIT is not traditional. I believe that implementing JIT in MRI should be more evolutional to be successful. The changes should be minimized but other ways should be still open. My second choice would be a specialized JIT with 3-4 faster compilation speed like luajit but it is in order magnitude bigger project (probably even more) than the current approach and for sure has a bigger chance to fail at the end. So the discussion of the current and other approaches would be helpful for me to better understand how reasonable my current approach is.</p> <p>Another thing is to avoid a work duplication. My very first post in this thread was to figure out if somebody is already working on something analogous. I did not get an exact confirmation that I am doing a duplicative work. So I went ahead with the project.</p> <p>For people who works on a Ruby JIT openly or in secret, posting info about my project would be helpful. At least my investigation of Oracle Graal and IBM OMR was very helpful.</p> <p>Also I am pretty new to MRI sources (I started to work on it just about year ago). I found that MRI lacks documentation and comments. There is no document like GCC internals which could be helpful for a newbie. So I might be doing stupid things which can be done easier and I might not be following some implicit source code policies.</p> <blockquote> <p>Can I help in any way?<br> Is the goal to try to compile your branch and get specific information<br> from the generated ruby?<br> If so what kind of information?</p> </blockquote> <p>Trying the branch and informing what you like or don't like would be helpful. It could be anything, e.g. insn names. As I wrote RTL insns should work for serious Ruby programs. I definitely can not say the same about JIT. Still there is a chance that RTL breaks some code. Also RTL code might be slower because not all edge cases are implemented with the same level optimization as stack code (e.g. multiple assignment) and some Ruby code can be better fit to the stack insns. It would be interesting to see such code.</p> <p>MJIT is at very early stages of development. I think it will have a big chance to be successful if I achieve inlining on the path <code>RUBY->C->Ruby</code> for a reasonable compilation time. But even implementing this will not speed some Ruby code considerably (e.g. floating point benchmarks can not be speed up without changing representation of double/VALUE in MRI).</p> <blockquote> <blockquote> <p>The project is still at very early stages. I am planning to spend<br> half of my work time on it at least for an year. I'll decide what to<br> do with the project in about year depending on where it is going to.</p> </blockquote> <p>In the README you explain very well all the surroundings around your<br> choices and the possibilities.</p> </blockquote> <p>I omitted a few other pros and cons of the choices.</p> <blockquote> <p>That makes me believe there may be space for collaboration from<br> someone that is willing to get deeper into the C level code.<br> If there is anyway I can be helpful please say so.</p> </blockquote> <p>Thank you. I guess if I get some considerable performance improvement, some help can be useful for more or less independent works. But unfortunately, I am not at this stage yet. I hope to get performance improvements I expect in a half year.</p> <blockquote> <p>Once again thank you very much and keep up with your excellent<br> contribution making available to the rest of us the same level of<br> detail and conversation as much as possible.</p> </blockquote> <p>Thank you for kind words, Daniel and Eric.</p> <blockquote> <p>I was waiting a little bit to see the amount of reception this post<br> would have and surprisingly only Eric replied to you.<br> Why is that?</p> </blockquote> <p>I think people need some time to evaluate the current state of the project and perspectives. It is not a traditional approach to JIT. This is what at least I would do myself. There are a lot of details in the new code. I would spend time to read sources to understand the approach better. And usually the concerned people are very busy. So it might need a few weeks.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-30T03:12:13Z</p> <ul></ul><p>Sorry, Matthew. I can not find your message on<br> <a href="https://bugs.ruby-lang.org/issues/12589" class="external">https://bugs.ruby-lang.org/issues/12589</a>. So I am sending this message<br> through email.</p> <p>On 03/29/2017 04:36 PM, Matthew Gaudet wrote:</p> <blockquote> <p>Hi Vladimir,</p> <p>First and foremost, let me join in with others in thanking you for<br> opening up your experimentation. I suspect that you'd be one of the<br> 'secret' Ruby JITs Chris Seaton was talking about <a href="https://twitter.com/ChrisGSeaton/status/811303853488488448" class="external">1</a>. One more secret<br> JIT to go :)</p> </blockquote> <p>Thank you. I would not call it a secret. I wrote about it couple times<br> publicly. But it was a quite development. This is my first major update<br> about the project.</p> <blockquote> <blockquote> <p>I believe that implementing JIT in MRI should be more evolutional to<br> be successful.</p> <p>[...]</p> <p>Another thing is to avoid a work duplication.</p> </blockquote> <p>So far, evolutionary approaches have heavily dominated the work we've<br> done with Ruby+OMR as well. I also recently wrote an article about what<br> needs to happen with Ruby+OMR <a href="https://developer.ibm.com/open/2017/03/01/ruby-omr-jit-compiler-whats-next/" class="external">2</a>. One thing in that article I want to<br> call out is my belief that those of us working on JIT compilers for MRI<br> have many opportunities to share ideas, implementation and features.<br> My hope is that we can all keep each other in mind when working on<br> things.</p> </blockquote> <p>I read your article. It was helpful. And I am agree with you about<br> sharing the ideas.</p> <blockquote> <p>I haven't had a huge amount of time to go through your patches, though,<br> I have gone through some of it. One comment I would make is that it<br> seems you've got two very separate projects here: One is a re-design of<br> YARV as an RTL machine, and the other is MJIT, your JIT that takes<br> advantage of the structure of the RTL instructions. In my opinion, it is<br> worth considering these two projects separately. My (offhand) guess<br> would be that I could adapt Ruby+OMR to consume the RTL instructions in<br> a couple of weeks, and other (secret) JITs may be in a similar place.</p> </blockquote> <p>Yes, may be you are right about separating the project. For me it is<br> just one project. I don't see MJIT development without RTL. I'll need a<br> program analysis and RTL is more adequate approach for this than stack<br> insns.</p> <blockquote> <p>Your approach to MJIT certainly seems interesting. I was quite<br> impressed with the compile times you mentioned -- when I was first<br> thinking about your approach I had thought they would be quite a bit<br> higher.</p> <p>One question I have (and this is largely for the Ruby community to<br> answer) is about how to measure impacts from JITs on non-performance<br> metrics. In this case for example, should MJIT's dynamic memory<br> footprint be computed as the total of the Ruby process and GCC, or<br> can we maybe ignore the GCC costs -- at a cost to compilation time you<br> could do the compiles elsewhere, and you have a clear path to<br> Ahead-of-Time compilation in your mind.</p> </blockquote> <p>Yes we should measure memory footprint too to compare different JITs.</p> <p>MJIT code itself is currently very small about 40KB. GCC code is<br> pretty big about 20MB (LLVM library is even bigger) but code of multiple<br> running instances of GCC (even hundred of them) will have the same 20MB<br> in memory at least on Linux.</p> <p>The data created in GCC is more important. GCC is not monstrous. As any<br> optimizing compiler, it works by passes (in GCC there are more 300 of<br> them): a pass gets IR, allocates the pass data, transforms IR, and frees<br> the data. So the peak consumption is not big. I'd say the peak<br> consumption for typical ISEQ with the compiled environment would be<br> about couple megabytes.</p> <p>GCC developers really care about data consumption and compiler speed.<br> There are some passes (GCSE and RA) which consume a lot of data<br> (sometimes the data consumption is quadratic of IR size). Still GCC is<br> every tunable and such behaviour can be avoided with particular options<br> and parameters. I suspect, other JIT implementations will have<br> analogous memory footprint for the data if they do inlining.</p> <blockquote> <p>My recollection is that one of the reasons rujit was abandoned was<br> because its memory footprint was considered unacceptable, but, I don't<br> don't know how that conclusion was drawn.</p> </blockquote> <p>It would be interesting to know all reasons why rujit was abandoned. I<br> suspect it was more than just the data consumption.</p> <p>You can not implement JIT without consuming additional memory. May be<br> for some MRI environments like Heroku the additional memory consumption<br> can be critical. And for such environment it might be better not to use<br> JIT at all. Still there are other Ruby environments where people can<br> spare memory consumption for faster code.</p> <blockquote> <blockquote> <p>At least my investigation of Oracle Graal and IBM OMR was very helpful.</p> </blockquote> <p>Glad we could help. One small note: The project is called Eclipse OMR,<br> not IBM OMR. While IBM is the largest contributor right now, we're<br> trying to build a community around the project, and it is run through<br> the Eclipse foundation.</p> </blockquote> <p>Thanks for the clarification.</p> <p>I can also share my finding about Ruby OMR. I found that Ruby OMR is<br> one thread program. So MRI waits for OMR to get a machine code and it<br> hurts the performance. I think the compilation should be done in<br> parallel with code execution in the interpreter as for Graal or JVM.</p> <blockquote> </blockquote> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-30T04:41:30Z</p> <ul></ul><p>Hi Matthew,</p> <blockquote> <p><a href="https://developer.ibm.com/open/2017/03/01/ruby-omr-jit-compiler-whats-next/" class="external">https://developer.ibm.com/open/2017/03/01/ruby-omr-jit-compiler-whats-next/</a></p> </blockquote> <p>I was reading your article, and I would like to say that what you<br> present there is just fantastic in my point of view.<br> Why fantastic? Because having IBM embracing Ruby in that way can only<br> give Ruby a brilliant future.<br> We have IBM and Oracle and Heroku and Redhat. How many companies more<br> besides Japan (which also should be better exposed)? It is not just<br> some developers. This is a powerful message for the world community<br> and in my opinion Ruby needs to clearly present it to the wider<br> audience.</p> <p>This pleases me because I'm totally Ruby biased (for the better and the worst).<br> (For me Ruby should be used everywhere. Even as a replacement for<br> javascript. Opal needs more emphasis. I just love it.)</p> <p>Ever since I heard about Ruby 3x3 in Matz announcement that I clearly<br> saw it would be a major opportunity for Ruby to stand out from the<br> crowd. A genius marketing move that well coordinated could have a very<br> important impact in the coming future regarding the dynamic languages<br> current competitive ecosystem.</p> <p>I want to be part of it and have been trying to find a way to do that.<br> This is the reason I asked Vladimir what help could he be using from me.<br> I even asked about Ruby 3x3 to Eric regarding my symbols thread which<br> is not dead.</p> <p>It is also great that you agree that there is much room for collaboration.<br> I'm a newbie in terms of compilers and JITs and all that jazz but I'm<br> willing to dig in and learn as much as possible and contribute the<br> better I can.</p> <p>For me it doesn't matter in what project.<br> What is important for me is a collaborative environment where we can<br> communicate and learn things step-by-step throughout the way which<br> seems what you have in your mind to offer.</p> <p>Very glad you are creating the eclipse community.</p> <p>You ask there what would be the best way to build that community.<br> I have a suggestion: Consider doing it by sharing the discussions with<br> ruby-core like Vladimir is doing.<br> I was totally unaware of your current work if it not for this thread<br> (I thought OMR was still closed code).<br> Anyone that do care about Ruby development subscribes to ruby-core.</p> <p>I believe I can help also in terms of organisation.<br> I have clear ideas on how to improve ruby regarding communication and<br> documentation.<br> And I'm very focused on architecture logic speaking about web<br> development and DevOps but software design as a all.<br> I'm pretty sure I will learn tones working with you and being part of<br> this endeavour but I can bring some added value in that regard.</p> <p>Like Vladimir said Ruby lacks a way for new people to come on board in<br> an easy way. When I develop code I always pay lots of emphasis to the<br> files organisation and design patterns being put in place, the tests<br> and documentation so that it can be always easy to understand the<br> architecture and reasons some options have been made.</p> <p>Ruby 3x3 is for me a big opportunity to look at that problem and try<br> to put some architecture documents in place.</p> <p>This implies that for me each one of this projects should work in<br> close form with ruby core developers. Again a reason to have OMR<br> directly linked to ruby core issue tracker.</p> <p>You mention as well that the existence of multiple JIT projects and<br> that competition can only bring good things to Ruby itself.<br> Couldn't agree more. Important for me is to not let this projects to die.<br> One of the great things the ruby community has, is that ability to<br> make each developer feel at home.<br> Matz was able to build that throughout the time.</p> <p>Let me hear your thoughts on the matter.<br> If you are ready to bring me on board I'm ready to step in.</p> <p>A note on that regard is that all my contribution for now will need to<br> be out of work hours.<br> But in the future maybe I can convince my company to sponsorship me.<br> No promise as I didn't speak with them yet.</p> <p>Regards,</p> <p>Daniel</p> <p>P.S.</p> <p>(This text is pretty much some scattered thoughts but I will send it<br> as it is anyway. Have so much things to say that I'm afraid if I start<br> to better structure the text it will become to big for someone to<br> read)</p> <p>P.S.2</p> <p>Sorry Vladimir for replying to Matthew on your thread. But I'm doing<br> it to emphasise how much I think we should work together on this<br> matter. (I could have sent a private email, think it is much better<br> this way)</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-31T17:27:51Z</p> <ul></ul><p>vmakarov (Vladimir Makarov) wrote:</p> <blockquote> <p>Sorry, Matthew. I can not find your message on<br> <a href="https://bugs.ruby-lang.org/issues/12589" class="external">https://bugs.ruby-lang.org/issues/12589</a>. So I am sending this message<br> through email.</p> </blockquote> <p>Very curious! I don't quite know what went wrong... so here I am writing<br> a reply in Redmine to make sure it shows up for future searchers :)</p> <blockquote> <p>I read your article. It was helpful. And I am agree with you about<br> sharing the ideas.</p> </blockquote> <p>Glad to hear it. Let me know if there's any feature you'd like to see implemented<br> that you'd like collaboration on. I've already submitted a patch for one feature<br> we expect to be useful in the future (<a href="https://bugs.ruby-lang.org/issues/13265" class="external">https://bugs.ruby-lang.org/issues/13265</a>),<br> and would be interested in helping to do more if desired.</p> <blockquote> <p>Yes, may be you are right about separating the project. For me it is<br> just one project. I don't see MJIT development without RTL. I'll need a<br> program analysis and RTL is more adequate approach for this than stack<br> insns.</p> </blockquote> <p>I totally understand. Especially for you, I can see how RTL feels like almost<br> a means-to-an-end; I would just encourage you (and others in the Ruby community)<br> to think of them separately, as if RTL is superior, it would be a shame to lose<br> that progress if MJIT doesn't satisfy all its goals.</p> <blockquote> <p>Yes we should measure memory footprint too to compare different JITs.</p> <p>MJIT code itself is currently very small about 40KB. GCC code is<br> pretty big about 20MB (LLVM library is even bigger) but code of multiple<br> running instances of GCC (even hundred of them) will have the same 20MB<br> in memory at least on Linux.</p> <p>The data created in GCC is more important. GCC is not monstrous. As any<br> optimizing compiler, it works by passes (in GCC there are more 300 of<br> them): a pass gets IR, allocates the pass data, transforms IR, and frees<br> the data. So the peak consumption is not big. I'd say the peak<br> consumption for typical ISEQ with the compiled environment would be<br> about couple megabytes.</p> </blockquote> <p>Kudos to the GCC developers (yourself included). That seems eminently reasonable.</p> <blockquote> <p>You can not implement JIT without consuming additional memory. May be<br> for some MRI environments like Heroku the additional memory consumption<br> can be critical. And for such environment it might be better not to use<br> JIT at all. Still there are other Ruby environments where people can<br> spare memory consumption for faster code.</p> </blockquote> <p>Indeed. I spoke at Ruby Kaigi 2016 <a href="http://rubykaigi.org/2016/presentations/MattStudies.html" class="external">1</a> trying very hard to encouraging thinking<br> about exactly what it is that 3x3 should accomplish, and how to measure. As<br> I am sure you are aware, the selection of benchmark and benchmarking methodology<br> is key to making sure you actually achieve your aims.</p> <blockquote> <p>I can also share my finding about Ruby OMR. I found that Ruby OMR is<br> one thread program. So MRI waits for OMR to get a machine code and it<br> hurts the performance. I think the compilation should be done in<br> parallel with code execution in the interpreter as for Graal or JVM.</p> </blockquote> <p>Absolutely agree. It's an item we've opened <a href="https://github.com/rubyomr-preview/ruby/issues/30" class="external">2</a>, but just haven't gotten around<br> to implementing.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-03-31T19:14:20Z</p> <ul></ul><p>magaudet (Matthew Gaudet) wrote:</p> <blockquote> <blockquote> <p>You can not implement JIT without consuming additional memory. May be<br> for some MRI environments like Heroku the additional memory consumption<br> can be critical. And for such environment it might be better not to use<br> JIT at all. Still there are other Ruby environments where people can<br> spare memory consumption for faster code.</p> </blockquote> <p>Indeed. I spoke at Ruby Kaigi 2016 <a href="http://rubykaigi.org/2016/presentations/MattStudies.html" class="external">1</a> trying very hard to encouraging thinking<br> about exactly what it is that 3x3 should accomplish, and how to measure. As<br> I am sure you are aware, the selection of benchmark and benchmarking methodology<br> is key to making sure you actually achieve your aims.</p> </blockquote> <p>By the way, I did some memory consumption measurements using size of<br> max (peak) resident area for a small Ruby program (about 15 lines) and<br> its sub-processes on free x86-64 machine with 32GB memory using<br> <code>(j)ruby --disable-gems</code>. Here are the numbers:</p> <pre><code>Ruby trunk: 6.4MB RTL: 6.5MB RTL+GCC JIT: 26.9MB RTL+LLVM JIT: 52.1MB OMR: 6.5MB OMR+JIT: 18.0MB jruby: 244.5MB Graal: 771.0MB </code></pre> <p>It can give a rough idea what JIT memory consumption costs are.</p> <p>The numbers should be taken with a grain of salt. It includes<br> all code size too. As I wrote multiple running program copies<br> share the code. And in case of GCC (cc1) it is about 20MB (so<br> in a case of 20 running GCC on a server, the average size of max<br> resident area could be about 7.9MB).</p> <p>I have no idea what is the code size of jruby and Graal as they<br> use sub-processes and I know nothing about them.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-04-08T23:41:38Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p><a href="https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch</a></p> </blockquote> <p>I've only taken a light look at it; but I think RTL seems<br> interesting promise. I needed the following patch to remove<br> "restrict" to compile on Debian stable:</p> <pre><code>https://80x24.org/spew/20170408231647.8664-1-e@80x24.org/raw </code></pre> <p>I also noted some rubyspec failures around break/while loops which<br> might be RTL related (make update-rubyspec && make test-rubyspec):</p> <pre><code>https://80x24.org/spew/20170408231930.GA11999@starla/ </code></pre> <p>(The Random.urandom can be ignored since you're on an old version)</p> <p>I haven't tried JIT, yet, as I'm already unhappy with current<br> Ruby memory usage; but if RTL alone can provide small speed<br> improvements without significant footprint I can deal with it.</p> <p>I'm currently running dtas-player with RTL to play music and it<br> seems fine <a href="https://80x24.org/dtas/" class="external">https://80x24.org/dtas/</a></p> <p>Thanks.</p> <p>Disclaimer: I do not use proprietary software (including JS) or<br> GUI browsers; so I use "git fetch vnmakarov" and other<br> normal git commands to fetch your changes after having the following<br> entry in .git/config:</p> <pre><code>[remote "vnmakarov"] fetch = +refs/heads/*:refs/remotes/vnmakarov/* url = https://github.com/vnmakarov/ruby.git </code></pre> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-04-09T15:17:08Z</p> <ul></ul><p>normalperson (Eric Wong) wrote:</p> <blockquote> <p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p><a href="https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch" class="external">https://github.com/vnmakarov/ruby/compare/rtl_mjit_branch_base...rtl_mjit_branch</a></p> </blockquote> <p>I've only taken a light look at it; but I think RTL seems<br> interesting promise. I needed the following patch to remove<br> "restrict" to compile on Debian stable:</p> <p><a href="https://80x24.org/spew/20170408231647.8664-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20170408231647.8664-1-e@80x24.org/raw</a></p> <p>I also noted some rubyspec failures around break/while loops which<br> might be RTL related (make update-rubyspec && make test-rubyspec):</p> <p><a href="https://80x24.org/spew/20170408231930.GA11999@starla/" class="external">https://80x24.org/spew/20170408231930.GA11999@starla/</a></p> <p>(The Random.urandom can be ignored since you're on an old version)</p> </blockquote> <p>Thank you for your feedback, Eric. I'll work on issues you found.</p> <p>So far I spent about 80% of my MRI efforts on RTL. But probably it was because of the learning curve. I did not try RTL on serious Ruby applications yet. On small benchmarks, I got from 0% to 100% (for a simple while loop) improvement. I'd say the average improvement could be 10%. MRI has too many calls on which majority of time spent. So savings on less insn dispatching and memory traffic have a small impact. In some cases RTL can be even worse. For example, <code>o.m(a1, a2, a3)</code> has the following stack insns and RTL insns:</p> <pre><code> push <o index> push <a1 index> push <a2 index> push <a3 index> send <callinfo> <cache> </code></pre> <pre><code> loc2temp -2, <a1 index> loc2temp -3, <a2 index> loc2temp -4, <a3 index> call_recv <call data>, <o index>, -1 </code></pre> <p>RTL insns are 18% longer for this example. I am going to investigate what the overall length of executed stack insns vs RTL insns when I resume my work on the project.</p> <blockquote> <p>I haven't tried JIT, yet, as I'm already unhappy with current<br> Ruby memory usage; but if RTL alone can provide small speed<br> improvements without significant footprint I can deal with it.</p> </blockquote> <p>I believe there would be no additional footprint for RTL insn or there would an insignificant increase (1-2%).</p> <p>JIT is ready only for small benchmarks right now. My big worry is in using exec wrapper when we go from JITed code execution to interpreted code execution to another JITed code and so on. It might increase stack usage. But I am going to work on removing exec wrapper usage in some cases.</p> <p>If you are not happy with the current MRI memory footprint, you will be definitely unhappy with any JIT because their work will require much more peak memory (at least in order of magnitude) than the current MRI footprint.</p> <p>But I think with my approach I can use much less memory and CPUs (JITs might require more CPU usage because of the compilations) than jruby or Graal. My JIT will also have no startup delay which is huge for jruby and Graal. Still achieving a better performance (wall clock execution) should be the first priority of my JIT project.</p> <p>By the way, I forgot to mention that my approach also opens a possibility in future to distribute gems in C code without binaries and it might help gems portability.</p> <blockquote> <p>I'm currently running dtas-player with RTL to play music and it<br> seems fine <a href="https://80x24.org/dtas/" class="external">https://80x24.org/dtas/</a></p> </blockquote> <p>Great! Thank you for sharing this.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-04-09T21:22:19Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <pre><code>**stack-based** insns to **register transfer** ones. The idea behind it is to decrease VM dispatch overhead as approximately 2 times less RTL insns are necessary than stack based insns for the same program (for Ruby it is probably even less as a typical Ruby program contains a lot of method calls and the arguments are passed through the stack). But *decreasing memory traffic* is even more important advantage of RTL insns as an RTL insn can address temporaries (stack) and local variables in any combination. So there is no necessity to put an insn result on the stack and then move it to a local variable or put variable value on the stack and then use it as an insn operand. Insns doing more also provide a bigger scope for C compiler optimizations. </code></pre> </blockquote> <p>One optimization I'd like to add while remaining 100% compatible<br> with existing code is to add a way to annotate read-only args for<br> methods (at least those defined in C-API). That will allow<br> delaying putstring instructions and giving them the same effect<br> as putobject.</p> <p>This would require having visibility into the resolved method<br> at runtime; before putting its args on the stack.</p> <p>One trivial example would be the following, where<br> String#start_with? has been annotated(*) with the args being<br> read-only:</p> <pre><code>foo.start_with?("/") </code></pre> <p>Instead of resolving the 'putstring "/"', first;<br> the method "start_with?" is resolved.</p> <p>If start_with? is String#start_with? with a constant<br> annotation(*) for the arg(s); the 'putstring "/"'<br> instruction returns the string w/o resurrecting it<br> to avoid the allocation.</p> <p>This would be a more generic way of doing things like<br> opt_aref_with/opt_aset_with; but without adding more global<br> redefinition flags.</p> <p>(*) Defining a method may change from:</p> <p>rb_define_method(rb_cString, "start_with?", rb_str_start_with, -1);</p> <p>To something like:</p> <p>rb_define_method2(rb_cString, "start_with?", rb_str_start_with,<br> "RO(*prefixes)");</p> <p>But rb_define_method should continue to work as-is for old code;<br> but having a new rb_define_method2 would also allow us to fix<br> current inefficiencies in rb_scan_args and rb_get_kwargs.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-04-12T02:17:58Z</p> <ul></ul><p>normalperson (Eric Wong) wrote:</p> <blockquote> <p>One optimization I'd like to add while remaining 100% compatible<br> with existing code is to add a way to annotate read-only args for<br> methods (at least those defined in C-API). That will allow<br> delaying putstring instructions and giving them the same effect<br> as putobject.</p> </blockquote> <p>Your idea is interesting. I guess the optimization would be very useful and help MRI memory system.</p> <p>I'll think too how to implement it with RTL insns.</p> <p>I wanted to try new call insns where the call arguments are call insn parameters, e.g. <code>call2 recv, arg1, arg2</code> where recv, arg1, arg2 are location indexes or even values. If it works with the performance point of view, the optimization implementation would be pretty straightforward.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-01T00:56:07Z</p> <ul></ul><p>I've updated README.md of the project. I added performance (wall, CPU time, memory consumption) comparison of the current state of MJIT with some other MRI versions (v2.0, base) and implementations (JRuby, Graal, OMR) on different benchmarks including OPTCARROT.</p> <p>I hope it will be interesting and save time if somebody decides to evaluate MJIT.</p> <p>You can find the performance section on <a href="https://github.com/vnmakarov/ruby#update-31-may-2017" class="external">https://github.com/vnmakarov/ruby#update-31-may-2017</a></p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-03T01:41:27Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>I've updated README.md of the project. I added performance (wall, CPU time, memory consumption) comparison of the current state of MJIT with some other MRI versions (v2.0, base) and implementations (JRuby, Graal, OMR) on different benchmarks including OPTCARROT.</p> </blockquote> <p>Thanks.</p> <p>Btw, have you explored the GNU lightning JIT at all?<br> <a href="http://www.gnu.org/software/lightning/" class="external">http://www.gnu.org/software/lightning/</a><br> I'm on the mailing list and it doesn't seem very active, though...</p> <blockquote> <p>I hope it will be interesting and save time if somebody decides to evaluate MJIT.</p> <p>You can find the performance section on <a href="https://github.com/vnmakarov/ruby#update-31-may-2017" class="external">https://github.com/vnmakarov/ruby#update-31-may-2017</a></p> </blockquote> <p>I encountered a new compatibility problem with gcc 4.9 on<br> Debian stable with -Werror=incompatible-pointer-types not<br> being supported.</p> <p>Also, my previous comment about C99 "restrict" not working on<br> my setup still applies.</p> <p>Sorry, I haven't had more time to look at your work; but I guess<br> it's mostly ko1's job since I'm no a compiler/VM expert;<br> just a *nix plumber.</p> <p>Thanks again.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-04T17:06:00Z</p> <ul></ul><p>normalperson (Eric Wong) wrote:</p> <blockquote> <p>Btw, have you explored the GNU lightning JIT at all?<br> <a href="http://www.gnu.org/software/lightning/" class="external">http://www.gnu.org/software/lightning/</a><br> I'm on the mailing list and it doesn't seem very active, though...</p> </blockquote> <p>Yes, I know about GNULighting, Eric. It is an old project. It is<br> just a portable assembler.</p> <p>Using it for JIT, it is like building a car having only a wheel. To<br> get a good performance result for JIT, you still need to write a lot<br> of optimizations. To get at least 50% performance result of GCC or<br> LLVM, somebody should spend many years to implement 10-20 most useful<br> optimizations. Using tracing JITs could simplify the work as a<br> compiled code has a very simple control flow graph (extended basic<br> blocks). But still it is a lot of work (at least 10 man years)<br> especially if you need achieve a good reliability and portability.</p> <p>It is possible to make a port of GCC to GNUlighting to use GCC<br> optimizations but it has no sense as GCC can directly generate a code<br> to targets supported by GNUlighting and even to much more targets.</p> <p>I've been studying JITs for many years and did some research on it and<br> I don't see a better approach than using GCC (GCC became 30 year old<br> this year) or LLVM. A huge amount of efforts of hundreds of<br> developers were spent on these compilers to get a reliable, portable,<br> and highly optimizing compilers.</p> <p>There is a myth that JVM JIT creates a better code than GCC/LLVM. I<br> saw reports saying that JVM JIT server compiler achieves only 40% of<br> performance of GCC/LLVM on some code (e.g. a rendering code) of<br> statically typed languages. That is why Azul (LLVM based java) exists<br> despite legal issues.</p> <p>I think Graal performance based on some articles I read is somewhere<br> in the middle between JVM client and server JIT compilers. Probably<br> OMR is approximately the same (although I know it less than Graal).</p> <p>So saying using GCC/LLVM is the best option in my opinion, still<br> there is an open question how to use them. GCC has libjit and LLVM<br> has MCJIT. Their API might change in future. It is better to use C<br> as the input because C definition is <em>stable</em>.</p> <p>It looks like libjit and MCJIT is a shortcut in comparison with C<br> compilation. But it is not that big as the biggest CPU consumers in<br> GCC and LLVM are optimizations not lexical analysis or parsing. I<br> minimize this difference even more with a few techniques, e.g. using<br> precompiled headers for the environment (declarations and definitions<br> needed for C code compiled from a Ruby method by JIT). By the way JVM<br> uses analogous approach (class data sharing) for faster startup.</p> <p>Using C as JIT input makes an easy switch from GCC to LLVM and vise<br> verse. It makes JIT debugging easier. It simplifies the<br> environment creation, e.g. libjit would need a huge number of tedious<br> calls to the API to do the same. Libjit has no ability to do inlining<br> too which would prevent inlining on Ruby->C->Ruby path.</p> <p>So I see more upsides than downsides of my approach. The current<br> performance are also encouraging -- <strong>I have better performance on many<br> tests than JRuby or Graal Ruby using much less computer resources</strong><br> although I did not start yet to work on Ruby->Ruby inlining and<br> Ruby->C->Ruby inlining.</p> <blockquote> <p>I encountered a new compatibility problem with gcc 4.9 on<br> Debian stable with -Werror=incompatible-pointer-types not<br> being supported.</p> <p>Also, my previous comment about C99 "restrict" not working on<br> my setup still applies.</p> </blockquote> <p>My project is just at the initial stages. There are a lot of things<br> to do. When I implement inlining I will focus on JIT reliability and<br> stability. I don't think MJIT can be used right now for more serious<br> programs.</p> <p>I should remove -Werror=incompatible-pointer-types from the script and<br> restrict added by me. They are not important.</p> <p>The code is currently tuned for my major environment (FC25 Linux). I<br> very rarely check OSX. Some work should be done for configuring MRI<br> to use right options depending on the environment.</p> <p>Eric, thank you for trying my code and giving a feedback. I really<br> appreciate it.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-06T01:51:37Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:<br> </p> <p>Thanks for detailed response.</p> <blockquote> <p>I should remove -Werror=incompatible-pointer-types from the script and<br> restrict added by me. They are not important.</p> </blockquote> <p>Actually, I've discovered AC_C_RESTRICT is convenient to add to<br> configure.in and I would like us to be able to take advantage of<br> useful C99 (and C1x) features as they become available:</p> <p><a href="https://80x24.org/spew/20170606012921.26806-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20170606012921.26806-1-e@80x24.org/raw</a></p> <p>Perhaps -Werror=incompatible-pointer-types can be made a<br> standard warning flag for building Ruby, too...</p> <blockquote> <p>The code is currently tuned for my major environment (FC25 Linux). I<br> very rarely check OSX. Some work should be done for configuring MRI<br> to use right options depending on the environment.</p> </blockquote> <p>Heh, I never run non-Free systems like OSX. Anyways I've been<br> using FreeBSD (via QEMU) sometimes and found Wshorten-64-to-32<br> errors in clang:</p> <p><a href="https://80x24.org/spew/20170606012944.26869-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20170606012944.26869-1-e@80x24.org/raw</a></p> <p>I guess that will help clang testers on other systems, too.</p> <blockquote> <p>Eric, thank you for trying my code and giving a feedback. I really<br> appreciate it.</p> </blockquote> <p>No problem! I'm still learning VM and compiler stuff from all<br> this and will do what I can to keep things running on the<br> ancient crap I have :)</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-12T21:51:38Z</p> <ul></ul><p>Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p> <blockquote> <p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>I should remove -Werror=incompatible-pointer-types from the script and<br> restrict added by me. They are not important.</p> </blockquote> <p>Actually, I've discovered AC_C_RESTRICT is convenient to add to<br> configure.in and I would like us to be able to take advantage of<br> useful C99 (and C1x) features as they become available:</p> <p><a href="https://80x24.org/spew/20170606012921.26806-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20170606012921.26806-1-e@80x24.org/raw</a></p> </blockquote> <p>Ah, I noticed you've removed "restrict" from your branch.<br> Technically, wouldn't that be a regression from an optimization<br> standpoint? (of course you know far more about compiler<br> optimization than I).</p> <blockquote> <p>Perhaps -Werror=incompatible-pointer-types can be made a<br> standard warning flag for building Ruby, too...</p> </blockquote> <p>That removal was fine by me.</p> <p>Not a particularly focused review, just random stuff I'm<br> spotting while taking breaks from other projects.</p> <p>Mostly just mundane systems stuff, nothing about the actual<br> mjit changes.</p> <ul> <li>I noticed mjit.c uses it's own custom doubly-linked list for<br> rb_mjit_batch_list. For me, that places a little extra burden<br> in having extra code to review. Any particular reason ccan/list<br> isn't used?</li> </ul> <p>Fwiw, the doubly linked list implementation in compile.c<br> predated ccan/list; and I didn't want to:</p> <p>a) risk throwing away known-working code</p> <p>b) introduce a the teeny performance regression for loop-heavy<br> code:</p> <p>ccan/list is faster for insert/delete, but slightly<br> slower iteration for loops from what I could tell.</p> <ul> <li> <p>The pthread_* stuff can probably use portable stuff defined in<br> thread.c and thread_*.h. (Unfortunately for me) Ruby needs to<br> support non-Free platforms :<</p> </li> <li> <p>fopen should probably be replaced by something which sets<br> cloexec; since the "e" flag of fopen is non-portable.</p> </li> </ul> <p>Perhaps rb_cloexec_open() + fdopen().</p> <ul> <li> <p>It looks like meant to use fflush instead of fsync; fflush is<br> all that's needed to ensure other processes can see the file<br> changes (and it's done transparently by fclose). fsync is to<br> ensure the file is committed to stable storage, and some folks<br> still use stable storage for /tmp. fsync before the final<br> fflush is wrong, even, as the kernel may not have all the<br> data from userspace</p> </li> <li> <p>get_uniq_fname should respect alternate tmpdirs like Dir.tmpdir, does<br> (in lib/dir.rb)</p> </li> <li> <p>we can use vfork + execve instead of fork to speed up process<br> creation; just need to move the fopen (which can call malloc)<br> into the parent. We've already used vfork for Process.spawn,<br> system(), ``, IO.popen for a few years.</p> </li> </ul> <p>None of these are super important; and I can eventually take<br> take some time to make send you patches or pull requests (via<br> email/redmine)</p> <p>rb_mjit_min_header-2.5.0.h takes forever to build...</p> <p>Thank again for taking your time to work on Ruby!</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-13T15:04:28Z</p> <ul></ul><p>normalperson (Eric Wong) wrote:</p> <blockquote> <p>Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p> <p>Ah, I noticed you've removed "restrict" from your branch.<br> Technically, wouldn't that be a regression from an optimization<br> standpoint? (of course you know far more about compiler<br> optimization than I).</p> </blockquote> <p>It was just a try to achieve desired aliasing. But it is hard to achieve this.<br> There are too many VALUE * pointers in MRI VM. Removing restrict I added does<br> not worsen the code. Aliasing is a weak point of C. Therefore many<br> HPC developers still prefer Fortran in many cases.</p> <p>I think changing type of pc might be more productive for achieving<br> necessary aliasing.</p> <blockquote> <blockquote> <p>Perhaps -Werror=incompatible-pointer-types can be made a<br> standard warning flag for building Ruby, too...</p> </blockquote> <p>That removal was fine by me.</p> <p>Not a particularly focused review, just random stuff I'm<br> spotting while taking breaks from other projects.</p> <p>Mostly just mundane systems stuff, nothing about the actual<br> mjit changes.</p> </blockquote> <p>Although it is random. Still it took your time to do this and<br> it is valuable to me. Thank you.</p> <blockquote> <ul> <li>I noticed mjit.c uses it's own custom doubly-linked list for<br> rb_mjit_batch_list. For me, that places a little extra burden<br> in having extra code to review. Any particular reason ccan/list<br> isn't used?</li> </ul> <p>Fwiw, the doubly linked list implementation in compile.c<br> predated ccan/list; and I didn't want to:</p> </blockquote> <p>I remember MRI lists when I worked on changing compile.c. Uniformity<br> of the code is important. I'll put it on my TODO list.</p> <blockquote> <p>a) risk throwing away known-working code</p> <p>b) introduce a the teeny performance regression for loop-heavy<br> code:</p> <p>ccan/list is faster for insert/delete, but slightly<br> slower iteration for loops from what I could tell.</p> <ul> <li> <p>The pthread_* stuff can probably use portable stuff defined in<br> thread.c and thread_*.h. (Unfortunately for me) Ruby needs to<br> support non-Free platforms :<</p> </li> <li> <p>fopen should probably be replaced by something which sets<br> cloexec; since the "e" flag of fopen is non-portable.</p> </li> </ul> <p>Perhaps rb_cloexec_open() + fdopen().</p> <ul> <li>It looks like meant to use fflush instead of fsync; fflush is<br> all that's needed to ensure other processes can see the file<br> changes (and it's done transparently by fclose). fsync is to<br> ensure the file is committed to stable storage, and some folks<br> still use stable storage for /tmp. fsync before the final<br> fflush is wrong, even, as the kernel may not have all the<br> data from userspace</li> </ul> </blockquote> <p>Yes, my mistake. I'll correct this. fsync is also worse with<br> the performance point of view.</p> <blockquote> <ul> <li>get_uniq_fname should respect alternate tmpdirs like Dir.tmpdir, does<br> (in lib/dir.rb)</li> </ul> </blockquote> <p>I'll investigate this. For JIT performance the used temp files should be<br> in a memory FS. If alternative tempdirs provide this, I should switch to it.</p> <blockquote> <ul> <li>we can use vfork + execve instead of fork to speed up process<br> creation; just need to move the fopen (which can call malloc)<br> into the parent. We've already used vfork for Process.spawn,<br> system(), ``, IO.popen for a few years.</li> </ul> </blockquote> <p>Yes, it can be a performance win although probably small one.</p> <blockquote> <p>None of these are super important; and I can eventually take<br> take some time to make send you patches or pull requests (via<br> email/redmine)</p> </blockquote> <p>Only if it is not a burden for you. You already gave a fresh look<br> at the code and proposed valuable improvements.</p> <p>I just focused on Linux and MacOS a bit. I ignored other OSes,<br> e.g. Windows. My major goal was to justify the approach with the<br> performance point of view and then work more on MJIT portability.</p> <p>Now I can say it works although a lot of performance improvements still<br> can and should be done. I think the portability work already could<br> start.</p> <blockquote> <p>rb_mjit_min_header-2.5.0.h takes forever to build...</p> </blockquote> <p>Yes, it is slow (about 75 sec on i3-7100). It is a ruby script trying<br> to remove unnecessary C definitions/declarations. After removing some<br> C code it calls C compiler to check that the code is valid.</p> <p>I tried many things to speed it up, e.g. checking that the header will<br> be the same, removing several declarations at once, using special C<br> compiler options to speed up the check. But I got your message that it<br> is still slow.</p> <p>I'll think about further speed up. May be I'll try running a few C<br> compilations in parallel or generating a bigger header as<br> loading/reading pre-compiled header takes a tiny part of even<br> a small method compilation.</p> <blockquote> <p>Thank again for taking your time to work on Ruby!</p> </blockquote> <p>Eric, thank you for your time reviewing my code.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-06-13T22:11:57Z</p> <ul></ul><p><a href="mailto:vmakarov@redhat.com" class="email">vmakarov@redhat.com</a> wrote:</p> <blockquote> <p>normalperson (Eric Wong) wrote:</p> <blockquote> <p>None of these are super important; and I can eventually take<br> take some time to make send you patches or pull requests (via<br> email/redmine)</p> </blockquote> <p>Only if it is not a burden for you. You already gave a fresh look<br> at the code and proposed valuable improvements.</p> </blockquote> <p>Alright; I've actually got plenty on my plate, but...</p> <blockquote> <blockquote> <p>rb_mjit_min_header-2.5.0.h takes forever to build...</p> </blockquote> <p>Yes, it is slow (about 75 sec on i3-7100). It is a ruby script trying<br> to remove unnecessary C definitions/declarations. After removing some<br> C code it calls C compiler to check that the code is valid.</p> </blockquote> <p>Yeah, that could actually be a blocker to potential contributors.</p> <p>I force myself to work on ancient hardware to notice slow things<br> before others do; and sometimes I'm less inclined or get<br> distracted by other projects when builds take a long time.</p> <p>Good to know you're also bothered by it :)</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-10-18T18:45:21Z</p> <ul></ul><p>Hi Vladimir. I was happy to talk with you about JIT at RubyKaigi.</p> <p>To help introducing RTL and MJIT to upstream Ruby core safely, I'm wondering if we might experimentally introduce optional (switchable by -j option) JIT infrastructure first and then separately introduce mandatory RTL instruction changes.</p> <p>My experimental attempt for that goal, YARV-MJIT, is here: <a href="https://github.com/k0kubun/yarv-mjit" class="external">https://github.com/k0kubun/yarv-mjit</a><br> It's basically a fork of your project, but VM part is not changed from current Ruby and compiler is different. It's much slower than MJIT but meaningfully faster than Ruby 2.5.</p> <p>If we take YARV-MJIT as intermediate step toward RTL-MJIT, we can keep major part of MJIT in upstream without introducing breaking instruction changes (as you know, VM instructions are already compiled by gems like bootsnap. So replacing instructions would be breaking even if it has no bug) possibly in 2.x. And I believe this approach will make it easier to maintain MJIT and make all Rubyists happy in final RTL+MJIT introduction at Ruby 3.</p> <p>This is work in progress and this comment is just for sharing my plan. My project's current quality is much worse than yours (I found every part of MJIT is well considered and really great during development of YARV-MJIT), so I need to improve mine before I really propose to introduce it to core.</p> <p>I want to hear your thoughts about this direction.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-10-19T04:19:40Z</p> <ul></ul><p>k0kubun (Takashi Kokubun) wrote:</p> <blockquote> <p>Hi Vladimir. I was happy to talk with you about JIT at RubyKaigi.</p> </blockquote> <p>Hi. I am also glad that I visited RubyKaigi and I got a lot of feedback from my discussions with Koichi, Matz, and you.</p> <blockquote> <p>To help introducing RTL and MJIT to upstream Ruby core safely, I'm wondering if we might experimentally introduce optional (switchable by -j option) JIT infrastructure first and then separately introduce mandatory RTL instruction changes.</p> </blockquote> <p>Yes. I guess it is possible. RTL and MJIT probably can be separated as projects. Matthew Gaudet already proposed to separate RTL. As I understand it could help OMR implementation in some way.</p> <blockquote> <p>My experimental attempt for that goal, YARV-MJIT, is here: <a href="https://github.com/k0kubun/yarv-mjit" class="external">https://github.com/k0kubun/yarv-mjit</a><br> It's basically a fork of your project, but VM part is not changed from current Ruby and compiler is different. It's much slower than MJIT but meaningfully faster than Ruby 2.5.</p> </blockquote> <p>I'll look at this. Thank you for pointing this out.</p> <blockquote> <p>If we take YARV-MJIT as intermediate step toward RTL-MJIT, we can keep major part of MJIT in upstream without introducing breaking instruction changes (as you know, VM instructions are already compiled by gems like bootsnap. So replacing instructions would be breaking even if it has no bug) possibly in 2.x. And I believe this approach will make it easier to maintain MJIT and make all Rubyists happy in final RTL+MJIT introduction at Ruby 3.</p> </blockquote> <p>Saving stack insns is what we discussed with Koichi at RubyKaigi and after that. I promised to investigate another way of RTL generation through stack insns. I am now realizing that it might be a better way than generating RTL directly from MRI nodes because</p> <ul> <li>it will not break existing applications working with stack insns</li> <li>stack insns could be a stable existing interface to VM. RTL will be definitely changing as some new optimizations are implemented. So RTL will be unstable for long time and probably there is no sense to open RTL to Ruby programmers at all. Actually a similar approach is used by JVM: bytecode as an interface with JVM and another internal IR for JIT which is not visible for JVM users.</li> <li>it will make merging trunk into rtl-mjit branch much easier because the current RTL code generation means complete rewriting big compile.c file and any change to compile.c on the trunk will be a merge problem (now rtl-mjit-branch is almost 1 year behind the trunk).</li> </ul> <p>Slowdown of nodes->stack insns->RTL path might be negligible in comparison with nodes->RTL path. And slowdown is a major disadvantage for stack insn -> RTL path.</p> <p>So about week ago, I started to work on generation of RTL from stack insns. When the implementation starts working I'll make it public (it will be a separate branch). I hope it will happen in about a month. But it might be delayed if I am distracted by GCC work.</p> <blockquote> <p>This is work in progress and this comment is just for sharing my plan. My project's current quality is much worse than yours (I found every part of MJIT is well considered and really great during development of YARV-MJIT), so I need to improve mine before I really propose to introduce it to core.</p> <p>I want to hear your thoughts about this direction.</p> </blockquote> <p>I am not against your plan. An alternative approach can be useful but it might be a waste of your time at the end. But any performance work requires a lot alternative implementations (e.g. the current global RA in GCC was actually one of my seven different RA implementations), some temporary solutions might become permanent. who knows.</p> <p>I still believe that RTL should exist at the end because GCC/LLVM optimizations will not solve all optimization problems.</p> <p>For example, GCC/LLVM optimizes well int->fixnum->int->... conversions but they can not optimize well double->flonum->double->... conversions because tagged double representation as Ruby values is too complicated. Therefore fp benchmarks are not improved significantly by MJIT. Optimizing would be not a problem for non-tagged versions of values (e.g. (mode, int) or (mode, double)) but switching to another value representation is practically not possible as the current representation is already reflected in Ruby (objectid) and MRI C interface.</p> <p>So the solution would be implementing analysis on RTL to use double values in JITted code of a method to avoid double->flonum and flonum->double conversions. RTL is a good fit to this.</p> <p>Basic type inference could be another example for RTL necessity. I could find other examples.</p> <p>MJIT itself is currently not stable. And I'd like to work on its stabilization after trying RTL generation from stack insns.</p> <p>That is my major thoughts about your proposal. Thank you for asking.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-10-19T08:23:29Z</p> <ul></ul><blockquote> <p>Saving stack insns is what we discussed with Koichi at RubyKaigi and after that. I promised to investigate another way of RTL generation through stack insns. I am now realizing that it might be a better way than generating RTL directly from MRI nodes because</p> <p>it will not break existing applications working with stack insns</p> </blockquote> <p>Oh, I didn't know that plan. I like that approach.</p> <blockquote> <p>stack insns could be a stable existing interface to VM. RTL will be definitely changing as some new optimizations are implemented. So RTL will be unstable for long time and probably there is no sense to open RTL to Ruby programmers at all. Actually a similar approach is used by JVM: bytecode as an interface with JVM and another internal IR for JIT which is not visible for JVM users.</p> </blockquote> <p>Good to know. That sounds a good way to introduce RTL insns for easy maintenance.</p> <blockquote> <p>Slowdown of nodes->stack insns->RTL path might be negligible in comparison with nodes->RTL path. And slowdown is a major disadvantage for stack insn -> RTL path.</p> </blockquote> <p>I agree that the slowdown is negligible. For major disadvantage, compared to YARV-MJIT, implementation and debugging would be more complex.<br> So maintainability and performance would be trade-off. Thus I think we need to decide approach comparing differences in implementation complexity and performance difference.</p> <blockquote> <p>An alternative approach can be useful but it might be a waste of your time at the end. But any performance work requires a lot alternative implementations (e.g. the current global RA in GCC was actually one of my seven different RA implementations), some temporary solutions might become permanent. who knows.</p> </blockquote> <p>As I'm hacking Ruby as not work but just hobby to enjoy improving my Ruby core understanding, it wouldn't be a waste of time even if I end up with developing seven different JIT implementations :)</p> <blockquote> <p>So the solution would be implementing analysis on RTL to use double values in JITted code of a method to avoid double->flonum and flonum->double conversions. RTL is a good fit to this.</p> </blockquote> <p>Question for my better understanding: Do you mean GCC and Clang can't optimize double<->flonum conversion well even if all necessary code is inlined? If so, having special effort to optimize it in Ruby core makes sense. I'm not sure why we can't do that with stack-based instructions or just in JIT-ed C code generation process. Can't we introduce instruction specialization (to avoid double<->flonum conversion, not sure its details) without having all instructions as register-based?</p> <blockquote> <p>Basic type inference could be another example for RTL necessity. I could find other examples.</p> </blockquote> <p>Type inference at RTL instructions is interesting topic which I couldn't understand well from discussion with you at RubyKaigi. I'm looking forward to seeing the example!</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-10-19T15:38:14Z</p> <ul></ul><p>k0kubun (Takashi Kokubun) wrote:</p> <blockquote> <blockquote> <p>An alternative approach can be useful but it might be a waste of your time at the end. But any performance work requires a lot alternative implementations (e.g. the current global RA in GCC was actually one of my seven different RA implementations), some temporary solutions might become permanent. who knows.</p> </blockquote> <p>As I'm hacking Ruby as not work but just hobby to enjoy improving my Ruby core understanding, it wouldn't be a waste of time even if I end up with developing seven different JIT implementations :)</p> </blockquote> <p>Sorry, Takashi. I was inaccurate. I am agree. Any serious problem solving (even if it does not result in MRI code change) makes anyone a better, more experienced MRI developer.</p> <blockquote> <blockquote> <p>So the solution would be implementing analysis on RTL to use double values in JITted code of a method to avoid double->flonum and flonum->double conversions. RTL is a good fit to this.</p> </blockquote> <p>Question for my better understanding: Do you mean GCC and Clang can't optimize double<->flonum conversion well even if all necessary code is inlined?</p> </blockquote> <p>Yes. It is too complicated for them. Tagging doubles manipulates with exponent and mantissa by constraining exponent range and using a part of exponent field to store few less significant bits of mantissa. Even worse, processing 0.0 makes it even more complicated. Optimizing compilers are not smart enough to see that untaging doubles is a reverse operation to tagging and vice versa.</p> <blockquote> <p>If so, having special effort to optimize it in Ruby core makes sense. I'm not sure why we can't do that with stack-based instructions or just in JIT-ed C code generation process. Can't we introduce instruction specialization (to avoid double<->flonum conversion, not sure its details) without having all instructions as register-based?</p> </blockquote> <p>You can do the optimization with stack insns. You need to analyze all method(s) code and see where from operand values come. It is easier to do with RTL.</p> <p>But actually the worst part with using stack insns for optimizations is that you can not easily transform a program on them (e.g. move an invariant expression from the loop -- you need to introduce new local vars for this) because they process values only in a stack mode and optimized code can process values in any order.</p> <p>In any case, if we are going to do some optimizations by ourself (and I see such necessity in the future) not only by GCC/LLVM, we need a convenient IR for this. I tried to explain it in my presentation at RubyKaigi.</p> <p>One simple case where we can avoid untagging is RTL insn with immediate operand (we can use double not VALUE for the immediate operand). It is actually on my TODO list.</p> <blockquote> <blockquote> <p>Basic type inference could be another example for RTL necessity. I could find other examples.</p> </blockquote> <p>Type inference at RTL instructions is interesting topic which I couldn't understand well from discussion with you at RubyKaigi. I'm looking forward to seeing the example!</p> </blockquote> <p><a href="https://github.com/dino-lang/dino/blob/master/DINO/d_inference.c" class="external">https://github.com/dino-lang/dino/blob/master/DINO/d_inference.c</a> is an example how a basic type inference can be implemented on RTL-like language. It is a different approach to algorithm W in Hindley–Milner type system. The algorithm consists of the following steps</p> <ol> <li>Building a control flow graph (CFG) consisting of basic blocks and control flow edges connecting them.</li> <li>Calculating available results of RTL instructions – this is a forward data-flow problem on the CFG.</li> <li>Using the availability information, building def-use chains connecting possible operands and results of RTL instructions and variables.</li> <li>Calculating the types of RTL instruction operands and results – this is a forward data-flow problem on the def-use graph.</li> </ol> <p>The definition of availability and def-use chains can be found practically in any book about optimizing compilers.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-10-20T10:13:01Z</p> <ul></ul><blockquote> <p>In any case, if we are going to do some optimizations by ourself (and I see such necessity in the future) not only by GCC/LLVM, we need a convenient IR for this.</p> </blockquote> <p>Yeah, I saw rtl_exec.c transforms ISeq dynamically and allows MJIT to have insn that can be inlined easily. I can imagine the same idea will work on stack->RTL->JIT version of your MJIT. For our shared goal (stack->JIT), I agree that having any IR might be a helpful tool.</p> <blockquote> <p>Tagging doubles manipulates with exponent and mantissa by constraining exponent range and using a part of exponent field to store few less significant bits of mantissa. Even worse, processing 0.0 makes it even more complicated. Optimizing compilers are not smart enough to see that untaging doubles is a reverse operation to tagging and vice versa.</p> <p>One simple case where we can avoid untagging is RTL insn with immediate operand (we can use double not VALUE for the immediate operand).</p> </blockquote> <p>That makes sense. If compiler can't do that with inlined code, we need to do in MJIT level and it would decrease effort in some level if insns are RTL.</p> <blockquote> <p><a href="https://github.com/dino-lang/dino/blob/master/DINO/d_inference.c" class="external">https://github.com/dino-lang/dino/blob/master/DINO/d_inference.c</a> is an example how a basic type inference can be implemented on RTL-like language. It is a different approach to algorithm W in Hindley–Milner type system. The algorithm consists of the following steps</p> <p>Building a control flow graph (CFG) consisting of basic blocks and control flow edges connecting them.<br> Calculating available results of RTL instructions – this is a forward data-flow problem on the CFG.<br> Using the availability information, building def-use chains connecting possible operands and results of RTL instructions and variables.<br> Calculating the types of RTL instruction operands and results – this is a forward data-flow problem on the def-use graph.<br> The definition of availability and def-use chains can be found practically in any book about optimizing compilers.</p> </blockquote> <p>Thank you for pointing it out and summary. I'll take a look.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-12-26T00:20:08Z</p> <ul></ul><p>Happy Holidays, Vladimir. As our work has many duplicated things, I'm proposing to partially merge your work to upstream at <a href="https://bugs.ruby-lang.org/issues/14235" class="external">https://bugs.ruby-lang.org/issues/14235</a>. I would like your opinion on it.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-12-26T00:21:01Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/14235">Feature #14235</a>: Merge MJIT infrastructure with conservative JIT compiler</i> added</li></ul> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-12-26T20:19:22Z</p> <ul></ul><p>k0kubun (Takashi Kokubun) wrote:</p> <blockquote> <p>Happy Holidays, Vladimir. As our work has many duplicated things, I'm proposing to partially merge your work to upstream at <a href="https://bugs.ruby-lang.org/issues/14235" class="external">https://bugs.ruby-lang.org/issues/14235</a>. I would like your opinion on it.</p> </blockquote> <p>Thank you, Takashi. Happy holidays to you and your family too.</p> <p>Thank you very much for working on MJIT and trying alternative ways to use it. You did a great progress with this project.</p> <p>First, I thought YARV-MJIT project is not worth to work but it seems already working and a working intermediate solution makes a lot of sense because it gives performance improvements and permits to debug and improve MJIT-engine on real applications. Working step by step is a good engineering approach.</p> <p>So <strong>I support your proposal</strong> but I guess you should get other ruby developer opinions, especially Koichi's one. I did not check your code. I am not sure that MJIT in your pull request will work for all platforms (I used pthreads but to make it more portable MRI thread implementation should be used. Also how MJIT should work on Windows is a question for me). I guess the portability issues can be solved later during 2018 year.</p> <p>Moreover YARV-MJIT might be a final solution. Although C code generated from RTL is better optimized by a C compiler and RTL is also more convenient for future optimizations in MRI itself, still stack insns can be optimized too although with more efforts and in a less effective way. So if I find that stack insns -> RTL translation has some serious problems, your approach can become a mainstream and I might switch to work on it too.</p> <p>Right now I see that possible potential problems with stack insn -> RTL approach are big compilation time and the interpretation speed.</p> <p>I am still working on RTL generation from stack insns. It is already a second version. It became a multipass algorithm because I need to provide the same state of emulated stack (depth and locations of the insn operands) on different CFG paths (it is a typical forward dataflow problem in compilers). So the translator might be slower than I originally thought.</p> <p>Also a lot of things should be done for RTL to provide the same or better RTL interpretation speed.</p> <p>In your pull request you are wondering what is the state of my stack insn->RTL generator. I planned to have a working generator before the end of the year but it takes more time than I thought. Now I hope to publish it sometime in February.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-12-27T09:07:22Z</p> <ul></ul><p>Thank you for sharing your thoughts and support.</p> <blockquote> <p>So I support your proposal but I guess you should get other ruby developer opinions, especially Koichi's one. I did not check your code.</p> </blockquote> <p>Today I got code review from Koichi-san and mame-san. We found potential bugs in exception handling and TracePoint support, but it's considered to be not so hard to fix in our discussion.</p> <p>And Koichi said: After fixing it and confirming that all tests pass on a mode that forces to synchronously compile all ISeqs and allows to compile infinite ISeqs, we can merge it.</p> <blockquote> <p>I am not sure that MJIT in your pull request will work for all platforms (I used pthreads but to make it more portable MRI thread implementation should be used. Also how MJIT should work on Windows is a question for me).</p> </blockquote> <p>I fixed pthread part to use Windows native thread for Windows. So it can be compiled on mswin64. At initial merge I'm not going to support cl.exe (this is recognized by mswin64 maintainer). I'm going to fix mjit_init to disable MJIT if a compiler is not found especially for mswin64, but it'll be the only change for platforms before merge. I have an idea for supporting cl.exe and it'll be done in early stage after merge.</p> <p>I understand the mswin64 support allows us to cover all platforms that had been considered as tier1 in <a href="https://bugs.ruby-lang.org/projects/ruby-trunk/wiki/SupportedPlatforms" class="external">https://bugs.ruby-lang.org/projects/ruby-trunk/wiki/SupportedPlatforms</a> (but tierX information is dropped recently). I believe tier2 ones work too (at least I confirmed it works on MinGW by <a href="https://github.com/vnmakarov/ruby/pull/4" class="external">https://github.com/vnmakarov/ruby/pull/4</a> work). So I think it should be sufficient for now.</p> <blockquote> <p>Moreover YARV-MJIT might be a final solution. Although C code generated from RTL is better optimized by a C compiler and RTL is also more convenient for future optimizations in MRI itself, still stack insns can be optimized too although with more efforts and in a less effective way. So if I find that stack insns -> RTL translation has some serious problems, your approach can become a mainstream and I might switch to work on it too.</p> </blockquote> <p>I see. I'll continue to improve YARV-MJIT for the case to have serious problems in stack insns -> RTL, but I'm willing to see your new version of JIT compiler as I (and probably many Ruby users) want faster Ruby and I'm interested in technical differences between stack and RTL. I'll keep the mjit_compile function easy to replace.</p> <blockquote> <p>Right now I see that possible potential problems with stack insn -> RTL approach are big compilation time and the interpretation speed.</p> <p>I am still working on RTL generation from stack insns. It is already a second version. It became a multipass algorithm because I need to provide the same state of emulated stack (depth and locations of the insn operands) on different CFG paths (it is a typical forward dataflow problem in compilers). So the translator might be slower than I originally thought.</p> </blockquote> <p>Interesting. Understanding your ideas and strategies from your code is always valuable experience, so I want to read it when it becomes ready to publish. Thank you for sharing the state.</p> <p>I hope merging the patch will help your MJIT development by reducing the cost to rebase against trunk. Once it's merged, let's use the same MJIT infrastructure and please send patches you want to include to upstream for "working step by step" anytime.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2017-12-27T10:50:13Z</p> <ul></ul><p>These are great news for ruby and it’s community. Thank you both for your great work.<br> Things can only get better! Long live ruby!</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-01-07T09:26:12Z</p> <ul></ul><p>vmakarov (Vladimir Makarov) wrote:</p> <blockquote> <p>For example, GCC/LLVM optimizes well int->fixnum->int->... conversions but they can not optimize well double->flonum->double->... conversions because tagged double representation as Ruby values is too complicated. […] switching to another value representation is practically not possible as the current representation is already reflected in Ruby (objectid) and MRI C interface.</p> </blockquote> <p>I don't think objectid should be a stopper. There are exactly three things that objectid guarantees:</p> <ul> <li>Every object has exactly one objectid.</li> <li>An object has the same objectid for its entire lifetime.</li> <li>No two objects have the same objectid at the same time (but may have the same objectid at different times).</li> </ul> <p>Any code (Ruby or C) that assumes anything more about objectids (such as specific values) is simply broken.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-19T21:32:10Z</p> <ul></ul><p>Last 4 months I've been working on generation of RTL from stack<br> insns. The reason for this is that stack insns are already a part of<br> CRuby. The current generation of RTL directly from the nodes actually<br> would remove this interface.</p> <p>Another reason for this work is to simplify future merging RTL and<br> MJIT branches with the trunk.</p> <p>I think I've reached a project state when I can make the branch<br> public. But still there are lot of things to do for this project.</p> <p>Generation of RTL from stack insns is a harder task than one from<br> nodes. When we generate RTL from nodes we have a lot of context.<br> When we generate RTL from stack insns we need to reconstruct this<br> context (from different CFG paths in a stack insn sequence).</p> <p>To reconstruct the context we emulate VM stack and can pass a stack<br> insn sequence several times. First we calculate possible stack values<br> on the label. It is a typical forward data flow problem in compilers<br> (the final fixed point is only temporaries on the emulated stack).<br> Then using this info we actually generate RTL insns on the last pass.</p> <p>I was afraid that stack insn -> RTL generation might considerably<br> slow done CRuby. Fortunately, it is not the case. This generation<br> for optcarrot with one frame (it means practically no execution) takes<br> about 0.2% of all CRuby run time. Running empty script takes about 2%<br> more in comparison with using direct generation of RTL from the nodes.</p> <p>I created a new branch in my repository for this project. The<br> branch name is <code>stack_rtl_mjit</code> (<a href="https://github.com/vnmakarov/ruby/tree/stack-rtl-mjit-base" class="external">https://github.com/vnmakarov/ruby/tree/stack-rtl-mjit-base</a>).<br> All my work including MJIT will be done on this branch. The previous<br> branch <code>rtl_mjit_branch</code> is frozen.</p> <p>The major code to generate RTL from stack insns is placed in a new<br> file rtl_gen.c.</p> <p>I am going to continue work on this branch. My next plans will be a<br> merge with the trunk and fixing bugs. It is a big job as the branch<br> is based on more than one year old trunk.</p> <p>There were a lot of changes since then which will affect the code I am<br> working on. The biggest one is Takashi Kokubun's work on MJIT for<br> YARV. Another one is trace insns removal by Koichi Sasada.</p> <p>I am planning to work on merging with trunk, unification of MJIT code<br> on trunk and the branch, and fixing bugs till April/May. Sorry for a<br> slow pacing but I have no much time for this work until gcc-8 release<br> (probably middle of April).</p> <p>After that I am going to work on MJIT optimizations including method<br> inlining.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-19T22:17:27Z</p> <ul></ul><p>I just measured your branch using Discourse bench at: <a href="https://github.com/discourse/discourse/blob/master/script/bench.rb" class="external">https://github.com/discourse/discourse/blob/master/script/bench.rb</a></p> <p>Looks like it is a bit slower than master:</p> <p>RTL:</p> <pre><code>--- categories: 50: 53 75: 59 90: 65 99: 104 home: 50: 59 75: 69 90: 82 99: 130 topic: 50: 60 75: 67 90: 78 99: 108 categories_admin: 50: 96 75: 103 90: 114 99: 174 home_admin: 50: 94 75: 106 90: 141 99: 181 topic_admin: 50: 110 75: 119 90: 136 99: 197 timings: load_rails: 3749 ruby-version: 2.5.0-p-1 rss_kb: 247952 pss_kb: 236618 memorysize: 5.88 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 1 kernelversion: 4.4.0 </code></pre> <p>Master</p> <pre><code>--- categories: 50: 48 75: 56 90: 59 99: 101 home: 50: 56 75: 66 90: 105 99: 127 topic: 50: 54 75: 65 90: 80 99: 122 categories_admin: 50: 101 75: 110 90: 141 99: 207 home_admin: 50: 90 75: 100 90: 106 99: 134 topic_admin: 50: 101 75: 108 90: 118 99: 172 timings: load_rails: 3789 ruby-version: 2.6.0-p-1 rss_kb: 276588 pss_kb: 265237 memorysize: 5.88 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 1 kernelversion: 4.4.0 </code></pre> <p>Very interesting to see the significant memory improvement, is that expected? only env var I am running is: <code>RUBY_GLOBAL_METHOD_CACHE_SIZE: 131072</code></p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-20T00:40:29Z</p> <ul></ul><p>Great work on rtl_gen, Vladimir! Keeping both stack insns and RTL insns would be good for safe migration.</p> <blockquote> <p>There were a lot of changes since then which will affect the code I am<br> working on. The biggest one is Takashi Kokubun's work on MJIT for<br> YARV. Another one is trace insns removal by Koichi Sasada.</p> </blockquote> <p>I hope your work on merging trunk to stack-rtl-mjit would be easy, which was the aim of Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Merge MJIT infrastructure with conservative JIT compiler (Closed)" href="https://bugs.ruby-lang.org/issues/14235">#14235</a>. mjit_compile takes rb_iseq_constant_body and you should be able to read rtl_encoded from it. Merging it would make your work much portable, which is already running on many RubyCIs.</p> <p>After merging it, which includes some fixes for JIT in test cases, could you try running "make test-all RUN_OPTS='--jit-wait --jit-min-calls=1'", and also "make test-all RUN_OPTS='--jit-wait --jit-min-calls=5'" if you're using some cache of method calls? The testing strategy was used for merging Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Merge MJIT infrastructure with conservative JIT compiler (Closed)" href="https://bugs.ruby-lang.org/issues/14235">#14235</a>, and it can make sure that JIT-ed code and RTL insns translated from stack insns work for many cases (Some tests would fail by timeout though).<br> You're calling abort() for "Not implemented" insns like run_once, but I think it should just skip compiling the ISeq and work, like current trunk. At least it would be needed to pass the test.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-20T04:32:03Z</p> <ul></ul><p>On 02/19/2018 07:40 PM, <a href="mailto:takashikkbn@gmail.com" class="email">takashikkbn@gmail.com</a> wrote:</p> <blockquote> <p>Issue <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: VM performance improvement proposal (Open)" href="https://bugs.ruby-lang.org/issues/12589">#12589</a> has been updated by k0kubun (Takashi Kokubun).</p> <p>Great work on rtl_gen, Vladimir! Keeping both stack insns and RTL insns would be good for safe migration.<br> Thank you, Takashi.</p> <blockquote> <p>There were a lot of changes since then which will affect the code I am<br> working on. The biggest one is Takashi Kokubun's work on MJIT for<br> YARV. Another one is trace insns removal by Koichi Sasada.</p> </blockquote> <p>I hope your work on merging trunk to stack-rtl-mjit would be easy, which was the aim of Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Merge MJIT infrastructure with conservative JIT compiler (Closed)" href="https://bugs.ruby-lang.org/issues/14235">#14235</a>. mjit_compile takes rb_iseq_constant_body and you should be able to read rtl_encoded from it. Merging it would make your work much portable, which is already running on many RubyCIs.</p> <p>After merging it, which includes some fixes for JIT in test cases, could you try running "make test-all RUN_OPTS='--jit-wait --jit-min-calls=1'", and also "make test-all RUN_OPTS='--jit-wait --jit-min-calls=5'" if you're using some cache of method calls? The testing strategy was used for merging Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Merge MJIT infrastructure with conservative JIT compiler (Closed)" href="https://bugs.ruby-lang.org/issues/14235">#14235</a>, and it can make sure that JIT-ed code and RTL insns translated from stack insns work for many cases.<br> You're calling abort() for "Not implemented" insns like run_once, but I think it should just skip compiling the ISeq and work, like current trunk. At least it would be needed to pass the test.</p> </blockquote> <p>Thank you for the tips. I am planing to start on merging trunk into<br> stack-rtl-mjit branch in a week or two.</p> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-20T04:42:00Z</p> <ul></ul><p>On 02/19/2018 05:17 PM, <a href="mailto:sam.saffron@gmail.com" class="email">sam.saffron@gmail.com</a> wrote:</p> <blockquote> <p>Issue <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: VM performance improvement proposal (Open)" href="https://bugs.ruby-lang.org/issues/12589">#12589</a> has been updated by sam.saffron (Sam Saffron).</p> <p>I just measured your branch using Discourse bench at: <a href="https://github.com/discourse/discourse/blob/master/script/bench.rb" class="external">https://github.com/discourse/discourse/blob/master/script/bench.rb</a></p> <p>Looks like it is a bit slower than master:<br> Trace insn are still generated on the branch. The current trunk does<br> not generate them. Removing trace insns by Koichi improved performance<br> by about 10%. I believe the branch will be faster when the trace insns<br> are also removed there. But it is hard to predict what the actual<br> improvement will be after that.<br> RTL:</p> <pre><code>timings: load_rails: 3749 ruby-version: 2.5.0-p-1 rss_kb: 247952 pss_kb: 236618 memorysize: 5.88 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 1 kernelversion: 4.4.0 </code></pre> <p>Master</p> <pre><code>timings: load_rails: 3789 ruby-version: 2.6.0-p-1 rss_kb: 276588 pss_kb: 265237 memorysize: 5.88 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 1 kernelversion: 4.4.0 </code></pre> <p>Very interesting to see the significant memory improvement, is that expected?<br> No. Actually I expected more memory usage for RTL. It is hard for me to<br> say a reason for memory improvement for now. When the current trunk<br> will be merged into the branch, I could speculate more. Now the branch<br> code is far behind (about 13 months) the current trunk.</p> </blockquote> <blockquote> <p>only env var I am running is: <code>RUBY_GLOBAL_METHOD_CACHE_SIZE: 131072</code></p> </blockquote> </article> <article> <h1>Ruby master - Feature #12589: VM performance improvement proposal</h1> <p>2018-02-20T05:00:26Z</p> <ul></ul><p>No problems, thank you Vladimir, let me know when you are ready for me to test again!</p> </article> </main></body></html>