Ruby master - Feature #14723: [WIP] sleepy GC</h1> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T04:52:45Z</p> <ul></ul><p>Could you give us more detail algorithm?</p> <p>2018/04/29 12:57、<a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a>のメール:</p> <blockquote> <p>Issue <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: [WIP] sleepy GC (Open)" href="https://bugs.ruby-lang.org/issues/14723">#14723</a> has been reported by normalperson (Eric Wong).</p> <hr> <p>Feature <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: [WIP] sleepy GC (Open)" href="https://bugs.ruby-lang.org/issues/14723">#14723</a>: [WIP] sleepy GC<br> <a href="https://bugs.ruby-lang.org/issues/14723" class="external">https://bugs.ruby-lang.org/issues/14723</a></p> <ul> <li>Author: normalperson (Eric Wong)</li> <li>Status: Open</li> <li>Priority: Normal</li> <li>Assignee:</li> <li>Target version:</li> </ul> <hr> <p>The idea is to use "idle time" when process is otherwise sleeping<br> and using no CPU time to perform GC. It makes sense because real<br> world traffic sees idle time due to network latency and waiting<br> for user input.</p> <p>Right now, it's Linux-only. Future patches will affect other sleeping<br> functions:</p> <p>IO.select, Kernel#sleep, Thread#join, Process.waitpid, etc...</p> <p>I don't know if this patch can be implemented for win32, right<br> now it's just dummy functions and that will be somebody elses<br> job. But all pthreads platforms should eventually benefit.</p> <p>Before this patch, the entropy-dependent script below takes 95MB<br> consistently on my system. Now, depending on the amount of<br> entropy on my system, it takes anywhere from 43MB to 75MB.</p> <p>I'm using /dev/urandom to simulate real-world network latency<br> variations. There is no improvement when using /dev/zero<br> because the process is never idle.</p> <p>require 'net/http'<br> require 'digest/md5'<br> Thread.abort_on_exception = true<br> s = TCPServer.new('127.0.0.1', 0)<br> len = 1024 * 1024 * 1024<br> th = Thread.new do<br> c = s.accept<br> c.readpartial(16384)<br> c.write("HTTP/1.0 200 OK\r\nContent-Length: #{len}\r\n\r\n")<br> IO.copy_stream('/dev/urandom', c, len)<br> c.close<br> end</p> <p>addr = s.addr<br> Net::HTTP.start(addr[3], addr[1]) do |http|<br> http.request_get('/') do |res|<br> dig = Digest::MD5.new<br> res.read_body { |buf|<br> dig.update(buf)<br> }<br> puts dig.hexdigest<br> end<br> end</p> <p>The above script is also dependent on net/protocol using<br> read_nonblock. Ordinary IO objects will need IO#nonblock=true<br> to see benefits (because they never hit rb_wait_for_single_fd)</p> <ul> <li>gc.c (rb_gc_inprogress): new function<br> (rb_gc_step): ditto</li> <li>internal.h: declare prototypes for new gc.c functions</li> <li>thread_pthread.c (gvl_contended_p): new function</li> <li>thread_win32.c (gvl_contended_p): ditto (dummy)</li> <li>thread.c (rb_wait_for_single_fd w/ ppoll):<br> use new functions to perform GC while GVL is uncontended<br> and GC is lazy sweeping or incremental marking<br> <a href="https://blade.ruby-lang.org/ruby-core/86265">[ruby-core:86265]</a></li> </ul> <pre><code> 2 part patch broken out https://80x24.org/spew/20180429035007.6499-2-e@80x24.org/raw https://80x24.org/spew/20180429035007.6499-3-e@80x24.org/raw Also on my "sleepy-gc" git branch @ git://80x24.org/ruby.git ---Files-------------------------------- sleepy-gc-wip-v1.diff (5.37 KB) -- https://bugs.ruby-lang.org/ </code></pre> </blockquote> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T06:04:42Z</p> <ul></ul><p>"Atdot.net" <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>Could you give us more detail algorithm?</p> </blockquote> <p>Pretty simple and I thought the patch was easy-to-read.</p> <p>Background is we can use ppoll with zero-timeout<br> ({.tv_sec = 0, .tv_nsec = 0 }) to check and return immediately<br> w/o releasing GVL. This means we can quickly check FD for<br> readiness.</p> <p>This is a quick check and even optimized inside the Linux kernel[1].</p> <p>thread_pthread.c also tracks GVL contention using .waiting field.</p> <p>I define GC-in-progress as (is_lazy_sweeping || is_incremental_packing)<br> (same condition for gc.c:gc_rest() function)</p> <p>Therefore, if GVL is uncontended and GC has work to-do, we use<br> zero-timeout ppoll and do incremental GC work (incremental mark</p> <ul> <li>lazy sweep) as long as we need to wait on FD.</li> </ul> <p>If GC is done or if there is GVL contention, we fall back to use<br> old code path and release GVL.</p> <p>For do_select case, it might be more expensive because select()<br> is innefficient for high FDs; but if a process is otherwise not<br> doing anything, I think it's OK to burn extra cycles to perform<br> GC sooner.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T07:42:53Z</p> <ul></ul><p>Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p> <blockquote> <p>This is a quick check and even optimized inside the Linux kernel[1].</p> </blockquote> <p>Sorry, forgot link:<br> [1] <a href="https://bogomips.org/mirrors/linux.git/tree/fs/select.c?h=v4.16#n851" class="external">https://bogomips.org/mirrors/linux.git/tree/fs/select.c?h=v4.16#n851</a><br> /* Optimise the no-wait case */</p> <p>Also, epoll also optimizes for timeout == 0:<br> <a href="https://bogomips.org/mirrors/linux.git/tree/fs/eventpoll.c?h=v4.16#n1754" class="external">https://bogomips.org/mirrors/linux.git/tree/fs/eventpoll.c?h=v4.16#n1754</a></p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T09:22:40Z</p> <ul></ul><p>Also, added "thread.c (do_select): perform GC if idle"</p> <p><a href="https://80x24.org/spew/20180429090250.GA15634@dcvr/raw" class="external">https://80x24.org/spew/20180429090250.GA15634@dcvr/raw</a></p> <p>And updated "sleepy-gc" git branch @ git://80x24.org/ruby.git<br> to 10bcc1908601e6f35ebef5ff66476b5cea6da96c.</p> <p>I'm not sure if native_sleep() is worth doing GC on in most<br> cases (Mutex#lock, Queue#pop, ...) because that's waiting<br> on local resources from other threads within our process.</p> <p>Typical callers of rb_wait_for_single_fd and do_select wait<br> on external events, so that means our own (Ruby) process is<br> idle.</p> <p>I guess Kernel#sleep can do GC work, too (not sure how common it<br> is to use)</p> <p>Process.waitpid, File#flock, IO#fcntl(F_SETLKW) are some next<br> targets where we can try non-blocking operations and GC before<br> trying blocking equivalents.</p> <p>Then, the next question is: do we start making all connected<br> SOCK_STREAM sockets non-blocking by default again? (as in Ruby<br> 1.8)</p> <p>I'm not sure about nonblock-by-default for pipes,<br> SOCK_SEQPACKET, and listen sockets; because they have<br> round-robin behavior which allows fair load distribution across<br> forked processes.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T11:03:54Z</p> <ul></ul><p>Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p> <blockquote> <p>I'm not sure if native_sleep() is worth doing GC on in most<br> cases (Mutex#lock, Queue#pop, ...) because that's waiting<br> on local resources from other threads within our process.</p> </blockquote> <p>Nevermind, native_sleep benefits from this because local threads<br> may release GVL in ways which cannot trigger GC from select/ppoll.</p> <p>Thus we need to rely on their dependent threads (using<br> Queue#pop, ConditionVariable#wait or similar) to sleep and<br> trigger GC:</p> <p><a href="https://80x24.org/spew/20180429105029.GA23412@dcvr/raw" class="external">https://80x24.org/spew/20180429105029.GA23412@dcvr/raw</a></p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-29T23:50:19Z</p> <ul></ul><p>I really really like this, its a free performance boost with almost no downsides.</p> <p>I guess the simplest way of measuring it would be to run something like Discourse bench with and without the patch. In theory we should get better timings after the patch cause it decreases odds that the various GC processes will run when the interpreter wants to run Ruby.</p> <p>Implementation wise it seems like you only have it on rb_wait_for_single_fd, is there any way you can make this work with the pg gem? It just builds on libpq per: <a href="https://www.postgresql.org/docs/8.3/static/libpq-async.html" class="external">https://www.postgresql.org/docs/8.3/static/libpq-async.html</a> so maybe you would need to expose an end point for libpq to "trigger" partial gc processes just when you send a query?</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-04-30T04:42:25Z</p> <ul></ul><p><a href="mailto:sam.saffron@gmail.com" class="email">sam.saffron@gmail.com</a> wrote:</p> <blockquote> <p>I really really like this, its a free performance boost with<br> almost no downsides.</p> </blockquote> <p>Almost... I need to revisit [PATCH 4/2] (native_sleep) due to<br> th->status changes and finalizers running causing compatibility<br> problems.</p> <blockquote> <p>I guess the simplest way of measuring it would be to run<br> something like Discourse bench with and without the patch. In<br> theory we should get better timings after the patch cause it<br> decreases odds that the various GC processes will run when the<br> interpreter wants to run Ruby.</p> </blockquote> <p>Depends on benchmark, if a benchmark is pinning things to 100%<br> CPU usage then I expect no improvements. But I don't think<br> real-world network servers are often at 100% CPU use.</p> <blockquote> <p>Implementation wise it seems like you only have it on<br> rb_wait_for_single_fd</p> </blockquote> <p>PATCH 3/2 added select() support, too</p> <blockquote> <p>, is there any way you can make this work<br> with the pg gem? It just builds on libpq per:<br> <a href="https://www.postgresql.org/docs/8.3/static/libpq-async.html" class="external">https://www.postgresql.org/docs/8.3/static/libpq-async.html</a> so<br> maybe you would need to expose an end point for libpq to<br> "trigger" partial gc processes just when you send a query?</p> </blockquote> <p>I'd need to look more deeply, but I recall 'pg' being one of the<br> few gems which worked well with 1.8 threads because FDs were<br> exposed for Ruby to select() on.</p> <p>So I'm not sure what they're doing these days where they give<br> the Ruby VM no way to distinguish between waiting on external<br> resource (FD) or doing something CPU-intensive locally.</p> <p>I guess you can cheat for now and do:</p> <pre><code>Thread.new do r, w = IO.pipe loop { r.wait_readable(0.01) } end </code></pre> <p>Which will constantly do incremental mark + lazy sweep. But<br> cross-thread free() is probably still bad on most malloc<br> implementations...</p> <p>If 4/2 worked reliably (tests pass, though...)):</p> <p>Thread.new { loop { sleep(0.01) } }</p> <p>(gotta run, back later-ish)</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T01:22:44Z</p> <ul></ul><p><a href="mailto:sam.saffron@gmail.com" class="email">sam.saffron@gmail.com</a> wrote:</p> <blockquote> <p>Implementation wise it seems like you only have it on rb_wait_for_single_fd, is there any way you can make this work with the pg gem? It just builds on libpq per: <a href="https://www.postgresql.org/docs/8.3/static/libpq-async.html" class="external">https://www.postgresql.org/docs/8.3/static/libpq-async.html</a> so maybe you would need to expose an end point for libpq to "trigger" partial gc processes just when you send a query?</p> </blockquote> <p>Actually, it seems it seems pg is using rb_thread_fd_select in<br> some places which will benefit from sleep detection, here.</p> <p>pgconn_block -> wait_socket_readable -> rb_thread_fd_select</p> <p>So it looks like the PG::Connection#async_exec/async_query/block<br> methods will all hit that. So it looks like PG users can<br> automatically benefit from this work (as well as some of the<br> auto-fiber stuff).</p> <p>That said, it looks like they're using rb_thread_fd_select on<br> a single FD, and Linux users would be better off if they used<br> rb_wait_for_single_fd instead. The latter has been optimized<br> for Linux since 1.9.3 to avoid malloc and O(n) behavior based<br> on FD number.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T02:31:14Z</p> <ul></ul><p>My concerns are:</p> <p>(1) Full GC (like GC.start) or step incremental marking/sweeping (to guarantee (or to reduce) the worst stopping time because of GC for every IO operation).<br> (2) How to know GC is required (if we invoke GC.start on any I/O (blocking) operations, it should be harmful)</p> <p>Seeing your comment #2,</p> <blockquote> <p>Therefore, if GVL is uncontended and GC has work to-do, we use<br> zero-timeout ppoll and do incremental GC work (incremental mark</p> <ul> <li>lazy sweep) as long as we need to wait on FD.</li> </ul> </blockquote> <p>they should be:</p> <p>(1) do a step for incremental marking/sweeping<br> (2) only when incremental marking or sweeping</p> <p>They are very reasonable for me.</p> <p>My understanding, your proposal in pseudo code is (pls correct me if it is wrong):</p> <pre><code>def io_operation while true if !GVL.contended? && GC.has_incremental_task? if result = io_operation(timeout: 0) > 0 # timeout = 0 means return immediately # There are result return result else GC.do_step end else GVL.release io_operation(timeout = long time) GVL.acquire end end end </code></pre> <p>No problem for me.</p> <p>However, your code <a href="https://80x24.org/spew/20180429035007.6499-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180429035007.6499-3-e@80x24.org/raw</a></p> <pre><code>+int +rb_gc_step(const rb_execution_context_t *ec) +{ + rb_objspace_t *objspace = rb_ec_vm_ptr(ec)->objspace; + + gc_rest(objspace); + + return rb_gc_inprogress(ec); +} </code></pre> <p><code>gc_rest()</code> do all of rest steps. Is it intentional?</p> <p>Another tiny comments:</p> <blockquote> <ul> <li>static const struct timespec zero;</li> </ul> </blockquote> <p><code>zero</code> doesn't seem to be initialized. intentional?</p> <hr> <p>Note:</p> <p>After introducing Guild, getting <code>contended</code> status should be high-cost (we need to use lock to see this info).<br> However, we can eliminate this check if we shrink the target: only have one Guild (== current MRI).</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T03:22:34Z</p> <ul></ul><p><a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>My understanding, your proposal in pseudo code is (pls correct me if it is wrong):</p> </blockquote> <p>Correct.</p> <blockquote> <p><code>gc_rest()</code> do all of rest steps. Is it intentional?</p> </blockquote> <p>I thought about that myself. I haven't measured impact much and<br> decided to have less code.</p> <p>We can also try the following to favor sweep before mark:</p> <pre><code>--- a/gc.c +++ b/gc.c @@ -6534,7 +6534,14 @@ rb_gc_step(const rb_execution_context_t *ec) { rb_objspace_t *objspace = rb_ec_vm_ptr(ec)->objspace; - gc_rest(objspace); + if (is_lazy_sweeping(heap_eden)) { + gc_sweep_rest(objspace); + } + else if (is_incremental_marking(objspace)) { + PUSH_MARK_FUNC_DATA(NULL); + gc_marks_rest(objspace); + POP_MARK_FUNC_DATA(); + } return rb_gc_inprogress(ec); } </code></pre> <blockquote> <p>Another tiny comments:</p> <blockquote> <ul> <li>static const struct timespec zero;</li> </ul> </blockquote> <p><code>zero</code> doesn't seem to be initialized. intentional?</p> </blockquote> <p>Yes, static and global are variables are auto-initialized to zero.<br> AFAIK this is true of all C compilers.</p> <blockquote> <p>Note:</p> <p>After introducing Guild, getting <code>contended</code> status should be high-cost (we need to use lock to see this info).<br> However, we can eliminate this check if we shrink the target: only have one Guild (== current MRI).</p> </blockquote> <p>So one objspace will be shared by different guilds?</p> <p>We may use atomics to check. I think sweep phase can be made<br> lock-free in the future.</p> <p>Originally I wanted to make sweep lock-free before making this<br> patch, but it seems unnecessary at the moment.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T03:33:13Z</p> <ul></ul><p>On 2018/05/01 12:18, Eric Wong wrote:</p> <blockquote> <blockquote> <p><code>gc_rest()</code> do all of rest steps. Is it intentional?</p> </blockquote> <p>I thought about that myself. I haven't measured impact much and<br> decided to have less code.</p> </blockquote> <p>On worst case, it takes few seconds. We have "incremental" mechanism so<br> we should use same incremental technique here too.</p> <blockquote> <blockquote> <p>Another tiny comments:</p> <blockquote> <ul> <li>static const struct timespec zero;</li> </ul> </blockquote> <p><code>zero</code> doesn't seem to be initialized. intentional?</p> </blockquote> <p>Yes, static and global are variables are auto-initialized to zero.<br> AFAIK this is true of all C compilers.</p> </blockquote> <p>Sorry, I missed <code>static const</code>. Thank you.</p> <blockquote> <blockquote> <p>Note:</p> <p>After introducing Guild, getting <code>contended</code> status should be high-cost (we need to use lock to see this info).<br> However, we can eliminate this check if we shrink the target: only have one Guild (== current MRI).</p> </blockquote> <p>So one objspace will be shared by different guilds?</p> </blockquote> <p>Yes.</p> <blockquote> <p>I think sweep phase can be made<br> lock-free in the future.</p> </blockquote> <p>Agreed.</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T03:52:40Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>On 2018/05/01 12:18, Eric Wong wrote:</p> <blockquote> <blockquote> <p><code>gc_rest()</code> do all of rest steps. Is it intentional?</p> </blockquote> <p>I thought about that myself. I haven't measured impact much and<br> decided to have less code.</p> </blockquote> <p>On worst case, it takes few seconds. We have "incremental" mechanism so we<br> should use same incremental technique here too.</p> </blockquote> <p>Oh sorry, I realize I was using the wrong gc.c functions :x<br> Something like:</p> <pre><code>--- a/gc.c +++ b/gc.c @@ -6533,8 +6533,12 @@ int rb_gc_step(const rb_execution_context_t *ec) { rb_objspace_t *objspace = rb_ec_vm_ptr(ec)->objspace; - - gc_rest(objspace); + if (is_lazy_sweeping(&objspace->eden_heap)) { + gc_sweep_step(objspace, &objspace->eden_heap); + } + else if (is_incremental_marking(objspace)) { + /* FIXME TODO */ + } return rb_gc_inprogress(ec); } </code></pre> <p>I haven't looked at incremental mark, yet :x</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T03:52:40Z</p> <ul></ul><p>On 2018/05/01 12:47, Eric Wong wrote:</p> <blockquote> <p>Oh sorry, I realize I was using the wrong gc.c functions :x<br> Something like:</p> </blockquote> <p>Thank you. No problem.</p> <p>More performance check will be great (to write a NEWS entry :))</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-01T08:52:46Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>On 2018/05/01 12:47, Eric Wong wrote:</p> <blockquote> <p>Oh sorry, I realize I was using the wrong gc.c functions :x<br> Something like:</p> </blockquote> <p>Thank you. No problem.</p> <p>More performance check will be great (to write a NEWS entry :))</p> </blockquote> <p>I have some folks interested in backport for 2.4 and 2.5.<br> Much of the code I write uses String#clear and other techniques<br> to reduce memory too aggressively to benefit.<br> I can make some patches to benchmark/ from existing examples<br> in commit messages.</p> <p>Anyways v2 of the series is available:</p> <p>The following changes since commit 41f4ac6aa21588722a6323dbbc34274b7e9aec49:</p> <p>ast.c: use enum in switch for warnings (2018-05-01 06:55:43 +0000)</p> <p>are available in the Git repository at:</p> <p>git://80x24.org/ruby.git sleepy-gc-v2</p> <p>for you to fetch changes up to 9d1609d318821b11614da6f952acadf7d3a3e083:</p> <p>thread.c: native_sleep callers may perform GC (2018-05-01 07:57:21 +0000)</p> <p>v2 updates:</p> <ul> <li>[PATCH 2/4] uses correct functions for incremental work</li> <li>[PATCH 3/4] accounts for select(2) clobbering its timeval arg</li> <li>[PATCH 4/4] totally redone; native_sleep callers are all rather<br> complex and it can be improved in future patches</li> </ul> <hr> <p>Eric Wong (4):<br> thread.c (timeout_prepare): common function<br> gc: rb_wait_for_single_fd performs GC if idle (Linux)<br> thread.c (do_select): perform GC if idle<br> thread.c: native_sleep callers may perform GC</p> <p>Individual patches available at:<br> <a href="https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> <p>Also have a Tor .onion mirror if <a href="https://80x24.org/" class="external">https://80x24.org/</a> breaks again:<br> <a href="http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-2-e@80x24.org/raw" class="external">http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-2-e@80x24.org/raw</a><br> <a href="http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-3-e@80x24.org/raw</a><br> <a href="http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-4-e@80x24.org/raw</a><br> <a href="http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">http://hjrcffqmbrq6wope.onion/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T02:22:55Z</p> <ul></ul><p>On 2018/05/01 17:46, Eric Wong wrote:</p> <blockquote> <p>Individual patches available at:<br> <a href="https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> </blockquote> <p>I'm not sure how to see all of diffs in one patch. Do you have?</p> <p>Anyway, small comments:</p> <blockquote> <p><a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a></p> </blockquote> <blockquote> <ul> <li>/* TODO: should this check is_incremental_marking() ? */</li> </ul> </blockquote> <p>Any problem to check it?</p> <blockquote> <p>+rb_gc_step(const rb_execution_context_t *ec)</p> </blockquote> <p>How about to add assertion that rb_gc_inprogress() returns true?</p> <p>--- a/internal.h<br> +++ b/internal.h<br> @@ -1290,6 +1290,10 @@ void rb_gc_writebarrier_remember(VALUE obj);<br> void ruby_gc_set_params(int safe_level);<br> void rb_copy_wb_protected_attribute(VALUE dest, VALUE obj);</p> <p>+struct rb_execution_context_struct;<br> +int rb_gc_inprogress(const struct rb_execution_context_struct *);<br> +int rb_gc_step(const struct rb_execution_context_struct *);<br> +</p> <p>How about to add them into gc.h?</p> <p><a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a></p> <p>I have no enough knowledge to review it.<br> Nobu?</p> <p><a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> <blockquote> <p>@@ -288,8 +294,17 @@ rb_mutex_lock(VALUE self)</p> </blockquote> <p>I can't understand why GC at acquiring (and restarting) timing is<br> needed. Why?</p> <p>For other functions, I have a same question.happen.</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T02:52:48Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>On 2018/05/01 17:46, Eric Wong wrote:</p> <blockquote> <p>Individual patches available at:<br> <a href="https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a><br> <a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> </blockquote> <p>I'm not sure how to see all of diffs in one patch. Do you have?</p> </blockquote> <p>I fetch and run "git diff" locally which gives me many options</p> <p>REMOTE=80x24<br> git remote add $REMOTE git://80x24.org/ruby.git<br> git fetch $REMOTE<br> git diff $OLD $NEW</p> <p>$OLD and $NEW are commits which "git request-pull" outputs in my previous<br> emails:</p> <pre><code>> The following changes since commit $OLD > > $OLD_SUBJECT > > are available in the Git repository at: > > git://80x24.org/ruby.git BRANCH > > for you to fetch changes up to $NEW </code></pre> <p>You can also:</p> <pre><code>curl https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw \ </code></pre> <p><a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a> <br> <a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a> <br> <a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a> <br> | git am</p> <p>(I run scripts from my $EDITOR and mail client, of course :)</p> <blockquote> <p>Anyway, small comments:</p> <blockquote> <p><a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a></p> </blockquote> <blockquote> <ul> <li>/* TODO: should this check is_incremental_marking() ? */</li> </ul> </blockquote> <p>Any problem to check it?</p> </blockquote> <p>Probably no problem, old comment. I originally only intended to<br> do lazy-sweep since I have not studied incremental marking,<br> much.</p> <blockquote> <blockquote> <p>+rb_gc_step(const rb_execution_context_t *ec)</p> </blockquote> <p>How about to add assertion that rb_gc_inprogress() returns true?</p> </blockquote> <p>I don't think that's safe. For native_sleep callers; we release<br> GVL after calling rb_gc_step; so sometimes rb_gc_step becomes<br> a no-op (because other thread took GVL and did GC).</p> <blockquote> <p>--- a/internal.h<br> +++ b/internal.h<br> @@ -1290,6 +1290,10 @@ void rb_gc_writebarrier_remember(VALUE obj);<br> void ruby_gc_set_params(int safe_level);<br> void rb_copy_wb_protected_attribute(VALUE dest, VALUE obj);</p> <p>+struct rb_execution_context_struct;<br> +int rb_gc_inprogress(const struct rb_execution_context_struct *);<br> +int rb_gc_step(const struct rb_execution_context_struct *);<br> +</p> <p>How about to add them into gc.h?</p> </blockquote> <p>Sure.</p> <blockquote> <p><a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a></p> <p>I have no enough knowledge to review it.<br> Nobu?</p> <p><a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> <blockquote> <p>@@ -288,8 +294,17 @@ rb_mutex_lock(VALUE self)</p> </blockquote> <p>I can't understand why GC at acquiring (and restarting) timing is needed.<br> Why?</p> <p>For other functions, I have a same question.happen.</p> </blockquote> <p>For mutex_lock, it only does GC if it can't acquire immediately.<br> Since mutex_lock cannot proceed, it can probably do GC.</p> <p>I release GVL at mutex_lock before GC since it needs to give<br> the other thread a chance to release the mutex.</p> <p>One problem I have now is threads in THREAD_STOPPED_FOREVER<br> state cannot continuously perform GC if some other thread<br> is constantly making garbage and never sleeping.</p> <p>nr = 100_000<br> th = Thread.new do<br> File.open('/dev/urandom') do |rd|<br> nr.times { rd.read(16384) }<br> end<br> end</p> <a name="no-improvement-since-it-enters-sleep-and-stays-there"></a> <h1 >no improvement, since it enters sleep and stays there<a href="#no-improvement-since-it-enters-sleep-and-stays-there" class="wiki-anchor">¶</a></h1> <p>th.join</p> <a name="instead-this-works-but-wastes-battery-if-theres-no-garbage"></a> <h1 >instead, this works (but wastes battery if there's no garbage)<a href="#instead-this-works-but-wastes-battery-if-theres-no-garbage" class="wiki-anchor">¶</a></h1> <p>true until th.join(0.01)</p> <p>So maybe we add heuristics for entering sleep for methods in<br> thread.c and thread_sync.c and possibly continuing to schedule<br> threads in THREAD_STOPPED_FOREVER state to enable them to<br> perform cleanup. I don't think this is urgent, and we can<br> ignore this case for now.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T03:52:53Z</p> <ul></ul><p>On 2018/05/02 11:49, Eric Wong wrote:</p> <blockquote> <p>I fetch and run "git diff" locally which gives me many options</p> <p>REMOTE=80x24<br> git remote add $REMOTE git://80x24.org/ruby.git<br> git fetch $REMOTE<br> git diff $OLD $NEW</p> <p>$OLD and $NEW are commits which "git request-pull" outputs in my previous<br> emails:</p> <blockquote> <p>The following changes since commit $OLD</p> <p>$OLD_SUBJECT</p> <p>are available in the Git repository at:</p> <p>git://80x24.org/ruby.git BRANCH</p> <p>for you to fetch changes up to $NEW</p> </blockquote> <p>You can also:</p> <p>curl <a href="https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-2-e@80x24.org/raw</a> <br> <a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a> <br> <a href="https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-4-e@80x24.org/raw</a> <br> <a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a> <br> | git am</p> <p>(I run scripts from my $EDITOR and mail client, of course :)</p> </blockquote> <p>Great. Thank you!</p> <blockquote> <blockquote> <blockquote> <p>+rb_gc_step(const rb_execution_context_t *ec)</p> </blockquote> <p>How about to add assertion that rb_gc_inprogress() returns true?</p> </blockquote> <p>I don't think that's safe. For native_sleep callers; we release<br> GVL after calling rb_gc_step; so sometimes rb_gc_step becomes<br> a no-op (because other thread took GVL and did GC).</p> </blockquote> <p>OK. I assumed that this "step" API is used with "rb_gc_inprogress()".<br> But it is not correct.</p> <blockquote> <blockquote> <p><a href="https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-5-e@80x24.org/raw</a></p> <blockquote> <p>@@ -288,8 +294,17 @@ rb_mutex_lock(VALUE self)</p> </blockquote> <p>I can't understand why GC at acquiring (and restarting) timing is needed.<br> Why?</p> <p>For other functions, I have a same question.happen.</p> </blockquote> <p>For mutex_lock, it only does GC if it can't acquire immediately.<br> Since mutex_lock cannot proceed, it can probably do GC.</p> </blockquote> <pre><code>+ if (mutex->th == th) { + mutex_locked(th, self); + } + if (do_gc) { + /* + * Likely no point in checking for GVL contention here + * this Mutex is already contended and we just yielded + * above. + */ + do_gc = rb_gc_step(th->ec); + } </code></pre> <p>it should be <code>else if (do_gc)</code>, isn't?</p> <blockquote> <p>One problem I have now is threads in THREAD_STOPPED_FOREVER<br> state cannot continuously perform GC if some other thread<br> is constantly making garbage and never sleeping.</p> </blockquote> <blockquote> <pre><code> nr = 100_000 th = Thread.new do File.open('/dev/urandom') do |rd| nr.times { rd.read(16384) } end end # no improvement, since it enters sleep and stays there th.join # instead, this works (but wastes battery if there's no garbage) true until th.join(0.01) </code></pre> </blockquote> <p>I'm not sure why it is a problem. Created thread do <code>read</code> and it can GC<br> incrementally, or if <code>read</code> return immediately, there are no need to<br> step more GC (usual GC should be enough), especially for throughput.</p> <blockquote> <p>So maybe we add heuristics for entering sleep for methods in<br> thread.c and thread_sync.c and possibly continuing to schedule<br> threads in THREAD_STOPPED_FOREVER state to enable them to<br> perform cleanup. I don't think this is urgent, and we can<br> ignore this case for now.</p> </blockquote> <p>"cleanup"? do GC steps? I agree on them (requirements and immediacy).</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T04:12:28Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>OK. I assumed that this "step" API is used with "rb_gc_inprogress()". But it<br> is not correct.</p> </blockquote> <p>Right, inprogress is only a hint. I will add a comment to that effect.<br> As with ppoll + read, there is always a chance of work being "stolen"<br> by other threads :)</p> <blockquote> <pre><code>+ if (mutex->th == th) { + mutex_locked(th, self); + } + if (do_gc) { + /* + * Likely no point in checking for GVL contention here + * this Mutex is already contended and we just yielded + * above. + */ + do_gc = rb_gc_step(th->ec); + } </code></pre> <p>it should be <code>else if (do_gc)</code>, isn't?</p> </blockquote> <p>Yes, I will fix.</p> <blockquote> <blockquote> <p>One problem I have now is threads in THREAD_STOPPED_FOREVER<br> state cannot continuously perform GC if some other thread<br> is constantly making garbage and never sleeping.</p> </blockquote> <blockquote> <pre><code> nr = 100_000 th = Thread.new do File.open('/dev/urandom') do |rd| nr.times { rd.read(16384) } end end # no improvement, since it enters sleep and stays there th.join # instead, this works (but wastes battery if there's no garbage) true until th.join(0.01) </code></pre> </blockquote> <p>I'm not sure why it is a problem. Created thread do <code>read</code> and it can GC<br> incrementally, or if <code>read</code> return immediately, there are no need to step<br> more GC (usual GC should be enough), especially for throughput.</p> </blockquote> <p>I suppose.... Note: read on urandom won't hit rb_wait_for_single_fd<br> to trigger GC(*), but it will only trigger GC via string allocation.</p> <p>(*) /dev/urandom can't return with EAGAIN, only /dev/random can</p> <blockquote> <blockquote> <p>So maybe we add heuristics for entering sleep for methods in<br> thread.c and thread_sync.c and possibly continuing to schedule<br> threads in THREAD_STOPPED_FOREVER state to enable them to<br> perform cleanup. I don't think this is urgent, and we can<br> ignore this case for now.</p> </blockquote> <p>"cleanup"? do GC steps? I agree on them (requirements and immediacy).</p> </blockquote> <p>Sure. Should I commit after adding "else" to mutex_lock?</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T04:22:24Z</p> <ul></ul><p>On 2018/05/02 13:08, Eric Wong wrote:</p> <blockquote> <p>Sure. Should I commit after adding "else" to mutex_lock?</p> </blockquote> <p>I want to ask to introduce "disable" macro (like USE_RGENGC) to measure<br> impact on this technique (and disable to separate issues). Please name<br> it as your favorite.</p> <p>Thanks,<br> Koichi</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T05:04:12Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>On 2018/05/02 13:08, Eric Wong wrote:</p> <blockquote> <p>Sure. Should I commit after adding "else" to mutex_lock?</p> </blockquote> <p>I want to ask to introduce "disable" macro (like USE_RGENGC) to measure<br> impact on this technique (and disable to separate issues). Please name it as<br> your favorite.</p> </blockquote> <p>OK, I added RUBY_GC_SLEEPY_SWEEP and RUBY_GC_SLEEPY_MARK macros:</p> <p>[PATCH 6/4] gc.c: allow disabling sleepy GC<br> <a href="https://80x24.org/spew/20180502045248.GA3949@80x24.org/raw" class="external">https://80x24.org/spew/20180502045248.GA3949@80x24.org/raw</a></p> <p>And missing "else":</p> <p>[PATCH 5/4] thread_sync.c (mutex_lock): add missing else<br> <a href="https://80x24.org/spew/20180502044255.GA30679@80x24.org/raw" class="external">https://80x24.org/spew/20180502044255.GA30679@80x24.org/raw</a></p> <p>I also added some benchmarks, but I'm not sure if dependency on<br> /dev/urandom is good because performance across machines and<br> kernel configuration can be very different.</p> <p><a href="https://80x24.org/spew/20180502045714.GA5427@whir/raw" class="external">https://80x24.org/spew/20180502045714.GA5427@whir/raw</a></p> <p>I need something which:<br> a) doesn't compete for GVL<br> b) takes a while</p> <p>Perhaps depending on fork() is fine, since it's just as<br> unportable as /dev/urandom is.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T05:19:34Z</p> <ul></ul><p>I can confirm this has a MAJOR benefit for particular workloads with the pg gem. In particular if you are using async_exec (which most of us should)</p> <pre><code>require 'pg' require 'benchmark/ips' $conn = PG.connect(dbname: 'postgres') Benchmark.ips do |b| b.config(time: 10, warmup: 3) b.report("exec") do $conn.exec("SELECT generate_series(1,10000)").to_a end b.report("async exec") do $conn.async_exec("SELECT generate_series(1,10000)").to_a end end </code></pre> <p>Before:</p> <pre><code>sam@ubuntu pg_perf % ruby test.rb Warming up -------------------------------------- exec 20.000 i/100ms async exec 21.000 i/100ms Calculating ------------------------------------- exec 212.760 (± 1.4%) i/s - 2.140k in 10.060122s async exec 214.570 (± 1.9%) i/s - 2.163k in 10.084992s sam@ubuntu pg_perf % ruby test.rb Warming up -------------------------------------- exec 19.000 i/100ms async exec 20.000 i/100ms Calculating ------------------------------------- exec 202.603 (± 5.9%) i/s - 2.033k in 10.072578s async exec 201.516 (± 6.0%) i/s - 2.020k in 10.062116s </code></pre> <p>After:</p> <pre><code>sam@ubuntu pg_perf % ruby test.rb Warming up -------------------------------------- exec 21.000 i/100ms async exec 23.000 i/100ms Calculating ------------------------------------- exec 211.320 (± 2.8%) i/s - 2.121k in 10.044445s async exec 240.188 (± 1.7%) i/s - 2.415k in 10.057509s sam@ubuntu pg_perf % ruby test.rb Warming up -------------------------------------- exec 20.000 i/100ms async exec 23.000 i/100ms Calculating ------------------------------------- exec 209.644 (± 1.4%) i/s - 2.100k in 10.018850s async exec 237.100 (± 2.1%) i/s - 2.392k in 10.092435s </code></pre> <p>So this moves us from 200-210 ops/s to 240 ops/s. This is a major perf boost, still to see if it holds on the full Discourse bench, but I expect major improvements cause waiting for SQL is very very very common in web apps.</p> <p>I do not expect too much benefit in concurrent puma workloads, but for us in unicorn we should have a pretty nice boost.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T06:33:23Z</p> <ul></ul><p>On 2018/05/02 14:00, Eric Wong wrote:</p> <blockquote> <p>I also added some benchmarks, but I'm not sure if dependency on<br> /dev/urandom is good because performance across machines and<br> kernel configuration can be very different.</p> </blockquote> <p>Sam's report?<br> Sam, could you try discourse benchmark?</p> <p>I'm not sure pg test on <a href="/issues/14723">[ruby-core:86820]</a> is suitable or not.</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-02T08:22:48Z</p> <ul></ul><p><a href="mailto:sam.saffron@gmail.com" class="email">sam.saffron@gmail.com</a> wrote:</p> <blockquote> <p>require 'benchmark/ips'</p> </blockquote> <blockquote> <p>So this moves us from 200-210 ops/s to 240 ops/s. This is a<br> major perf boost, still to see if it holds on the full<br> Discourse bench, but I expect major improvements cause waiting<br> for SQL is very very very common in web apps.</p> </blockquote> <p>Thanks! I wasn't even aiming for a speed improvement.<br> Any memory measurements?<br> I guess benchmark/ips won't show that.</p> <blockquote> <p>I do not expect too much benefit in concurrent puma workloads,<br> but for us in unicorn we should have a pretty nice boost.</p> </blockquote> <p>It really depends on CPU usage, I don't think it's common for<br> any server to be using all available CPU at all times; so<br> Ruby should be able to get background work done during<br> wait states.</p> <p>One difference in MT is cross-thread malloc/free (malloc returns<br> a pointer to be freed in another thread) isn't too great in most<br> malloc implementations I've studied.</p> <p>Though Ruby sometimes hits cross-thread malloc with or without<br> sleepy GC, it may be more common with sleepy GC. Before sleepy<br> GC, free() happens most in threads which malloc() most,<br> so it gets returned to the correct arena/cache most often.</p> <p>Haven't checked jemalloc in a while, but I remember cross-thread<br> was weak there in the 4.x days; maybe it's improved. glibc<br> wasn't terrible there and (I think it was) DJ Delorie was taking<br> it into account in his updates; but I haven't kept up with that<br> work. Forcing fewer arenas via MALLOC_ARENA_MAX also mitigates<br> this problem.</p> <p>I seem to recall Lockless Inc. malloc being REALLY good at<br> cross-thread malloc/free, but used too much memory overall in my<br> experience. cross-thread malloc/free can be a common pattern<br> for message-passing systems</p> <p>Anyways, maybe this will encourage me to try getting wfcqueue<br> into glibc malloc as I threatened to do in <a href="/issues/14718">[ruby-core:86731]</a><br> (/me shrivels in fear of GNU indentation style)</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-03T03:12:41Z</p> <ul></ul><p>I checked today's head-of-master with Rails Ruby Bench. The first run suggests a noticeable drop in performance between 2.6 preview 1 and head-of-master. It's not guaranteed that the drop is because of this change. I'll try to repro with more runs first, then see if this change seems responsible -- could be something else in 2.6. But I'm seeing a drop in RRB throughput from around 179 req/sec to around 170 req/sec, and a significant increase in variance between runs (from about 2.6 to about 11.9). This is with only 20 runs, though. I'll definitely get a lot more datapoints before I'm sure. But it's a large enough drop that it's probably <em>not</em> random noise.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-03T03:17:28Z</p> <ul></ul><p>Ah, never mind. It looks like the Ruby I tested doesn't have the sleepy GC changes! So it's slower, but that's not the fault of this patch. Great. I'll check this patch against that as a baseline.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-03T05:43:45Z</p> <ul></ul><p>From my testing on Discourse bench ... the difference is pretty much not that measurable</p> <p>Before patch</p> <pre><code>Unicorn: (workers: 3) Include env: false Iterations: 200, Best of: 1 Concurrency: 1 --- categories: 50: 58 75: 65 90: 73 99: 123 home: 50: 62 75: 70 90: 86 99: 139 topic: 50: 60 75: 65 90: 72 99: 117 categories_admin: 50: 101 75: 106 90: 115 99: 210 home_admin: 50: 107 75: 114 90: 132 99: 211 topic_admin: 50: 115 75: 123 90: 134 99: 201 timings: load_rails: 5444 ruby-version: 2.6.0-p-1 rss_kb: 196444 pss_kb: 139514 memorysize: 7.79 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 2 kernelversion: 4.15.0 rss_kb_23779: 309984 pss_kb_23779: 249785 rss_kb_23817: 307056 pss_kb_23817: 246738 rss_kb_23948: 304732 pss_kb_23948: 244364 </code></pre> <p>After patch:</p> <pre><code>Iterations: 200, Best of: 1 Concurrency: 1 --- categories: 50: 56 75: 61 90: 70 99: 116 home: 50: 63 75: 70 90: 77 99: 170 topic: 50: 61 75: 68 90: 77 99: 96 categories_admin: 50: 102 75: 111 90: 121 99: 182 home_admin: 50: 96 75: 102 90: 108 99: 205 topic_admin: 50: 109 75: 118 90: 130 99: 192 timings: load_rails: 4987 ruby-version: 2.6.0-p-1 rss_kb: 196004 pss_kb: 137541 memorysize: 7.79 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 2 kernelversion: 4.15.0 rss_kb_16393: 306312 pss_kb_16393: 244353 rss_kb_16438: 307052 pss_kb_16438: 244942 rss_kb_16555: 305092 pss_kb_16555: 242997 </code></pre> <p>Nothing really sticks out as absolutely an improvement across the board though some of the benches are a bit faster, memory is almost not impacted. It is no worse than head, but it is also not easy to measure how much better it is, we may need to repeat with significantly more iterations to remove noise.</p> <p>I do want to review Discourse carefully to ensure we are using async_exec everywhere... will do so later today.</p> <p>Eric if you feel like trying out the bench, clone: <a href="https://github.com/discourse/discourse.git" class="external">https://github.com/discourse/discourse.git</a> and run ruby script/bench.rb</p> <p>I also have some allocator benches you can play with at: <a href="https://github.com/SamSaffron/allocator_bench.git" class="external">https://github.com/SamSaffron/allocator_bench.git</a></p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-03T06:07:06Z</p> <ul></ul><p>I found one place where we were not using async_exec so I changed it to use async_exec... this is revised numbers:</p> <p>Pre patch:</p> <pre><code>categories: 50: 53 75: 59 90: 63 99: 76 home: 50: 57 75: 64 90: 68 99: 136 topic: 50: 58 75: 61 90: 68 99: 110 categories_admin: 50: 96 75: 102 90: 108 99: 184 home_admin: 50: 104 75: 112 90: 122 99: 213 topic_admin: 50: 115 75: 121 90: 139 99: 184 timings: load_rails: 4936 ruby-version: 2.6.0-p-1 rss_kb: 193500 pss_kb: 134214 memorysize: 7.79 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 2 kernelversion: 4.15.0 rss_kb_21961: 305616 pss_kb_21961: 243099 rss_kb_22009: 304644 pss_kb_22009: 241972 rss_kb_22133: 304108 pss_kb_22133: 241388 </code></pre> <p>Post patch:</p> <pre><code>Your Results: (note for timings- percentile is first, duration is second in millisecs) Unicorn: (workers: 3) Include env: false Iterations: 200, Best of: 1 Concurrency: 1 --- categories: 50: 54 75: 59 90: 66 99: 84 home: 50: 57 75: 62 90: 65 99: 139 topic: 50: 56 75: 61 90: 67 99: 104 categories_admin: 50: 95 75: 99 90: 106 99: 179 home_admin: 50: 99 75: 103 90: 106 99: 195 topic_admin: 50: 109 75: 114 90: 118 99: 163 timings: load_rails: 4851 ruby-version: 2.6.0-p-1 rss_kb: 195164 pss_kb: 136384 memorysize: 7.79 GB virtual: vmware architecture: amd64 operatingsystem: Ubuntu processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz physicalprocessorcount: 2 kernelversion: 4.15.0 rss_kb_19222: 305328 pss_kb_19222: 243213 rss_kb_19267: 303188 pss_kb_19267: 240952 rss_kb_19384: 307992 pss_kb_19384: 245778 </code></pre> <p>perf change seems hard to pin down properly.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-03T07:13:05Z</p> <ul></ul><p><a href="mailto:sam.saffron@gmail.com" class="email">sam.saffron@gmail.com</a> wrote:</p> <blockquote> <p>perf change seems a tiny bit more noticable.</p> </blockquote> <p>Thanks for benchmarking! Disappointing results, though.</p> <p>Is this is with my latest updates up thread with do_select<br> and gc_*_continue functions?</p> <p>Can you try #define RUBY_GC_SLEEPY_MARK 0 in gc.h to disable<br> incremental marking on sleep?</p> <p>I wonder if incremental marking is causing too many objects<br> to be marked when it is triggered deep in the stack.</p> <p>Marking is best done when the stack is shallow (where unicorn<br> calls IO.select), but could be harmful when the stack is deep<br> (where Pg calls rb_thread_fd_select).</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-04T21:32:56Z</p> <ul></ul><p>Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p> <blockquote> <p>Marking is best done when the stack is shallow (where unicorn<br> calls IO.select), but could be harmful when the stack is deep<br> (where Pg calls rb_thread_fd_select).</p> </blockquote> <p>Also, I think we need to start GC if no sweeping/marking is<br> inprogress.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-05T07:02:54Z</p> <ul></ul><p>I updated pg to <a href="https://github.com/ged/ruby-pg/commit/3dfd36bf08ba49cf87410ae73edb2dabbf715a2b" class="external">use rb_wait_for_single_fd()</a> instead of rb_thread_fd_select(). The change is already on the master branch: <a href="https://github.com/ged/ruby-pg" class="external">https://github.com/ged/ruby-pg</a> . However although the speedup is measurable in micro benchmarks, it is not within a rails context.</p> <p>In pg all IO bound methods release the GVL, but not all methods use rb_wait_for_single_fd() or rb_thread_fd_select() to wait for server answers. Only methods of the async API do this. This is why I proposed a change to rails, to make use of the async API only: <a href="https://github.com/rails/rails/pull/32820" class="external">https://github.com/rails/rails/pull/32820</a></p> <p>Hope that helps...</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-05T14:53:29Z</p> <ul></ul><p>For Rails Ruby Bench (large concurrent Rails benchmark based on Discourse), measuring sleepy-gc-v3 branch versus the previous commit, the difference isn't measurable. No detectable speedup. The sleepy-gc batch of runs has a higher variance in runtime, but that may just be an outlier or two - I'd need a lot more samples to see if it consistently gives higher variance. The variance is often randomly a bit different batch-to-batch.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-06T03:33:05Z</p> <ul></ul><p><a href="mailto:the.codefolio.guy@gmail.com" class="email">the.codefolio.guy@gmail.com</a> wrote:</p> <blockquote> <p>For Rails Ruby Bench (large concurrent Rails benchmark based<br> on Discourse),</p> </blockquote> <p>So multithreaded? Do you have any info on the amount of CPU<br> time was being used without these changes?</p> <p>If the CPU usage was already 100% or close before the patch,<br> then I'd expect no benefit.</p> <p>So yeah, for benchmarking, I would mainly expect it to show up<br> more in single-threaded benchmarks.</p> <p>But for practical use outside of benchmarks, I think there'll be<br> a benefit in all <100% CPU usage scenarios (which is typical<br> of real-world traffic, but not benchmarks).</p> <blockquote> <p>measuring sleepy-gc-v3 branch versus the<br> previous commit, the difference isn't measurable. No<br> detectable speedup. The sleepy-gc batch of runs has a higher<br> variance in runtime, but that may just be an outlier or two -<br> I'd need a lot more samples to see if it consistently gives<br> higher variance. The variance is often randomly a bit<br> different batch-to-batch.</p> </blockquote> <p>The variance might have something to do with the malloc and<br> settings used (arena count), especially when multithreaded.<br> (see what I wrote previously about cross-thread malloc/free).</p> <p>I experimented with some GC-start-on-sleep the other day,<br> but didn't get very far as far as having a small reproducible<br> benchmark case.</p> <p>If anybody wants to give me SSH access to a machine they run<br> 100% Free Software benchmarks on, my public key has always been<br> here:</p> <pre><code>https://yhbt.net/id_rsa.pub I will only use a terminal for Ruby development, no GUIs. </code></pre> <p>Thanks (also won't be around computers much for another day or two)</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-07T08:12:45Z</p> <ul></ul><p>On 2018/05/05 6:32, Eric Wong wrote:</p> <blockquote> <p>Also, I think we need to start GC if no sweeping/marking is<br> inprogress.</p> </blockquote> <p>This is a problem we need to discuss.</p> <p>Good: It can increase GC cleaning without additional overhead.</p> <p>Bad1: However if we kick unnecessary GCs it should be huge penalty.<br> Bad2: Also if we run multiple Ruby processes, it can be system's<br> overhead which consumes CPU resources which other process can run.</p> <p>--<br> // SASADA Koichi at atdot dot net</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-07T09:52:53Z</p> <ul></ul><p>Koichi Sasada <a href="mailto:ko1@atdot.net" class="email">ko1@atdot.net</a> wrote:</p> <blockquote> <p>On 2018/05/05 6:32, Eric Wong wrote:</p> <blockquote> <p>Also, I think we need to start GC if no sweeping/marking is<br> inprogress.</p> </blockquote> <p>This is a problem we need to discuss.</p> <p>Good: It can increase GC cleaning without additional overhead.</p> <p>Bad1: However if we kick unnecessary GCs it should be huge penalty.</p> </blockquote> <p>Right. Minor GC is still expensive, I wonder if we can make it<br> cheaper or semi-incremental. It can be incremental until next<br> newobj_of happens, at which point newobj_of must finish the<br> minor GC immediately. This may help some IO cases if object<br> creation can be avoided.</p> <p>For tracking GC statistics, we should probably keep them in<br> rb_execution_context_t instead of current globals using atomics.<br> To recover the most memory from GC, we want to do gc_mark_roots</p> <ol> <li>from the ec with the most allocations</li> <li>when it is at the shallowest stack point</li> </ol> <p>This is tricky in MT situations :<</p> <blockquote> <p>Bad2: Also if we run multiple Ruby processes, it can be system's overhead<br> which consumes CPU resources which other process can run.</p> </blockquote> <p>I hope this feature can reduce the use extra processes, even.<br> In other words, instead of having an N:1 process:core ratio, it<br> could become (N/2):1 or something.</p> <p>Now I need sleep myself :<</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-07T15:15:28Z</p> <ul></ul><p>normalperson (Eric Wong) wrote:</p> <blockquote> <p>So multithreaded? Do you have any info on the amount of CPU<br> time was being used without these changes?</p> </blockquote> <p>Highly multithreaded. Normally the CPU usage stays at nearly 100%. So I agree, this is not a great benchmark to show the benefit. The main result is that it didn't slow it down :-)</p> <blockquote> <p>The variance might have something to do with the malloc and<br> settings used (arena count), especially when multithreaded.<br> (see what I wrote previously about cross-thread malloc/free).</p> </blockquote> <p>Yeah. I'll need to run the benchmark a lot of times to be sure. It's not a large effect, if it's real.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-14T20:42:40Z</p> <ul></ul><p>I wrote:</p> <blockquote> <p>For tracking GC statistics, we should probably keep them in<br> rb_execution_context_t instead of current globals using atomics.<br> To recover the most memory from GC, we want to do gc_mark_roots</p> </blockquote> <p>That's maybe too complex for now, this patch (on top of existing<br> sleepy GC):</p> <p><a href="https://80x24.org/spew/20180514201509.28069-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20180514201509.28069-1-e@80x24.org/raw</a></p> <p>While the effect on big Rails apps seems minimal, I think the<br> significant improvements for small scripts is still helpful and<br> we can build on top of them. I am already satisfied with the<br> improvement from a Net::HTTP example from the first patch:</p> <p><a href="https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw" class="external">https://80x24.org/spew/20180501080844.22751-3-e@80x24.org/raw</a></p> <p>Since all new behavior changes can be easily disabled via gc.h,<br> I propose we commit the current changes to trunk for now to<br> gain more testing and feedback.</p> <p>Current series is up to 8 patches, but I will squash<br> "thread_sync.c (mutex_lock): add missing else" into<br> "thread.c: native_sleep callers may perform GC".</p> <p>The following changes since commit 6f0de6ed98e669e915455569fb4dae9022cb47b8:</p> <p>error.c: check redefined backtrace result (2018-05-14 08:33:14 +0000)</p> <p>are available in the Git repository at:</p> <p>git://80x24.org/ruby.git sleepy-gc-v6</p> <p>for you to fetch changes up to 6944014696bea793603d47db6dba0a1e83f1e430:</p> <p>gc.c: enter sleepy GC start (2018-05-14 20:25:29 +0000)</p> <hr> <p>Eric Wong (8):<br> thread.c (timeout_prepare): common function<br> gc: rb_wait_for_single_fd performs GC if idle (Linux)<br> thread.c (do_select): perform GC if idle<br> thread.c: native_sleep callers may perform GC<br> thread_sync.c (mutex_lock): add missing else<br> benchmark: add benchmarks for sleepy GC<br> gc.c: allow disabling sleepy GC<br> gc.c: enter sleepy GC start</p> <p>benchmark/bm_vm3_gc_io_select.rb | 30 +++++<br> benchmark/bm_vm3_gc_io_wait.rb | 21 ++++<br> benchmark/bm_vm3_gc_join_timeout.rb | 11 ++<br> benchmark/bm_vm3_gc_remote_free_spmc.rb | 15 +++<br> benchmark/bm_vm3_gc_szqueue.rb | 14 +++<br> gc.c | 55 +++++++++<br> gc.h | 28 +++++<br> thread.c | 197 +++++++++++++++++++++-----------<br> thread_pthread.c | 6 +<br> thread_sync.c | 21 +++-<br> thread_win32.c | 6 +<br> 11 files changed, 337 insertions(+), 67 deletions(-)<br> create mode 100644 benchmark/bm_vm3_gc_io_select.rb<br> create mode 100644 benchmark/bm_vm3_gc_io_wait.rb<br> create mode 100644 benchmark/bm_vm3_gc_join_timeout.rb<br> create mode 100644 benchmark/bm_vm3_gc_remote_free_spmc.rb<br> create mode 100644 benchmark/bm_vm3_gc_szqueue.rb</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-17T21:02:30Z</p> <ul></ul><p>I've now run a lot more batches of Rails Ruby Bench - 100 batches of 10,000 HTTP requests/batch. I am <em>definitely</em> seeing lower performance and more variance with Sleepy GC. Overall, Sleepy GC gets 169.4 req/sec mean throughput with variance of 6.4, while the previous commit gets 177.0 req/sec throughput with a variance of 3.8. So Sleepy GC v3 costs about 4% performance for Rails Ruby Bench running flat-out and completely parallel.</p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-18T07:33:16Z</p> <ul></ul><p><a href="mailto:the.codefolio.guy@gmail.com" class="email">the.codefolio.guy@gmail.com</a> wrote:</p> <blockquote> <p>Overall, Sleepy GC gets 169.4 req/sec mean throughput with<br> variance of 6.4, while the previous commit gets 177.0 req/sec<br> throughput with a variance of 3.8.</p> </blockquote> <p>Thanks for testing! I think we will need to work on increasing<br> granularity of the steps. The variance actually bothers me a<br> bit, more.</p> <p>I'll have to work on increasing granularity of the marking and<br> sweeping (which may hurt throughput in apps without IO-wait at<br> all...). And I won't be around much the next few days..</p> <p>Also, our malloc accounting is silly expensive(*) and I think<br> we can do some lazy sweeping before making big allocations</p> <p>(*) <a href="https://bugs.ruby-lang.org/issues/10238" class="external">https://bugs.ruby-lang.org/issues/10238</a></p> </article> <article> <h1>Ruby master - Feature #14723: [WIP] sleepy GC</h1> <p>2018-05-18T09:13:00Z</p> <ul></ul><blockquote> <p>I'll have to work on increasing granularity of the marking and<br> sweeping (which may hurt throughput in apps without IO-wait at<br> all...). And I won't be around much the next few days..</p> </blockquote> <p>Maybe the unlink_limit can be lowered if we are sweeping more<br> frequently:</p> <p><a href="https://80x24.org/spew/20180518085819.14892-9-e@80x24.org/raw" class="external">https://80x24.org/spew/20180518085819.14892-9-e@80x24.org/raw</a></p> <p>We may also add more sweeping around more malloc() calls.</p> <p>I also wonder if you can help narrow down which feature causes<br> the most damage to performance:</p> <p>RUBY_GC_SLEEPY_SWEEP || RUBY_GC_SLEEPY_MARK || RUBY_GC_SLEEPY_START</p> <p>Perhaps try defining RUBY_GC_SLEEPY_MARK and<br> RUBY_GC_SLEEPY_START to 0 in gc.h and see if that helps<br> (Originally, I only intended to try sleepy sweep)</p> <p>Anyways, rebased against current-ish trunk, since I had some<br> fixes and minor improvements which also conflicted with this:</p> <p>The following changes since commit 74724107e96228c34f92a1f210342891bb29400e:</p> <p>thread.c (rb_wait_for_single_fd): do not leak EINTR on timeout (2018-05-18 08:01:07 +0000)</p> <p>are available in the Git repository at:</p> <p>git://80x24.org/ruby.git sleepy-gc-v7</p> <p>for you to fetch changes up to f6745fe9acd3453a38eb646006a5e2703732f973:</p> <p>gc.c: lower sweep unlink limit and make tunable in gc.h (2018-05-18 08:51:45 +0000)</p> <hr> <p>Eric Wong (8):<br> thread.c (timeout_prepare): common function<br> gc: rb_wait_for_single_fd performs GC if idle (Linux)<br> thread.c (do_select): perform GC if idle<br> thread.c: native_sleep callers may perform GC<br> benchmark: add benchmarks for sleepy GC<br> gc.c: allow disabling sleepy GC<br> gc.c: enter sleepy GC start<br> gc.c: lower sweep unlink limit and make tunable in gc.h</p> <p>benchmark/bm_vm3_gc_io_select.rb | 30 +++++<br> benchmark/bm_vm3_gc_io_wait.rb | 21 ++++<br> benchmark/bm_vm3_gc_join_timeout.rb | 11 ++<br> benchmark/bm_vm3_gc_remote_free_spmc.rb | 15 +++<br> benchmark/bm_vm3_gc_szqueue.rb | 14 +++<br> gc.c | 57 +++++++++-<br> gc.h | 31 ++++++<br> thread.c | 191 +++++++++++++++++++++-----------<br> thread_pthread.c | 6 +<br> thread_sync.c | 21 +++-<br> thread_win32.c | 6 +<br> 11 files changed, 337 insertions(+), 66 deletions(-)<br> create mode 100644 benchmark/bm_vm3_gc_io_select.rb<br> create mode 100644 benchmark/bm_vm3_gc_io_wait.rb<br> create mode 100644 benchmark/bm_vm3_gc_join_timeout.rb<br> create mode 100644 benchmark/bm_vm3_gc_remote_free_spmc.rb<br> create mode 100644 benchmark/bm_vm3_gc_szqueue.rb</p> </article> </main></body></html>