Project

General

Profile

Feature #13517

[PATCH] reduce rb_mutex_t size from 160 to 80 bytes on 64-bit

Added by normalperson (Eric Wong) 9 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:80913]

Description

Instead of relying on a native condition variable and mutex for
every Ruby Mutex object, use a doubly linked-list to implement a
waiter queue in the Mutex.  The immediate benefit of this is
reducing the size of every Mutex object, as some projects have
many objects requiring synchronization.

In the future, this technique using a linked-list and on-stack
list node (struct mutex_waiter) should allow us to easily
transition to M:N threading model, as we can avoid the native
thread dependency to implement Mutex.

We already do something similar for autoload in variable.c,
and this was inspired by the Linux kernel wait queue (as
ccan/list is inspired by the Linux kernel linked-list).

Finaly, there are big performance improvements for Mutex
benchmarks, especially in contended cases:

measure target: real

name            |trunk  |built
----------------|------:|------:
loop_whileloop2 |  0.149|  0.148
vm2_mutex*      |  0.893|  0.651
vm_thread_mutex1|  0.809|  0.624
vm_thread_mutex2|  2.608|  0.628
vm_thread_mutex3| 28.227|  0.881

Speedup ratio: compare with the result of `trunk' (greater is better)

name            |built
----------------|------:
loop_whileloop2 |  1.002
vm2_mutex*      |  1.372
vm_thread_mutex1|  1.297
vm_thread_mutex2|  4.149
vm_thread_mutex3| 32.044

Tested on AMD FX-8320 8-core at 3.5GHz

Associated revisions

Revision 58604
Added by normal 9 months ago

reduce rb_mutex_t size from 160 to 80 bytes on 64-bit

Instead of relying on a native condition variable and mutex for
every Ruby Mutex object, use a doubly linked-list to implement a
waiter queue in the Mutex. The immediate benefit of this is
reducing the size of every Mutex object, as some projects have
many objects requiring synchronization.

In the future, this technique using a linked-list and on-stack
list node (struct mutex_waiter) should allow us to easily
transition to M:N threading model, as we can avoid the native
thread dependency to implement Mutex.

We already do something similar for autoload in variable.c,
and this was inspired by the Linux kernel wait queue (as
ccan/list is inspired by the Linux kernel linked-list).

Finaly, there are big performance improvements for Mutex
benchmarks, especially in contended cases:

measure target: real

name trunk built
loop_whileloop2 0.149 0.148
vm2_mutex* 0.893 0.651
vm_thread_mutex1 0.809 0.624
vm_thread_mutex2 2.608 0.628
vm_thread_mutex3 28.227 0.881

Speedup ratio: compare with the result of `trunk' (greater is better)

name built
loop_whileloop2 1.002
vm2_mutex* 1.372
vm_thread_mutex1 1.297
vm_thread_mutex2 4.149
vm_thread_mutex3 32.044

Tested on AMD FX-8320 8-core at 3.5GHz

  • thread_sync.c (struct mutex_waiter): new on-stack struct (struct rb_mutex_struct): remove native lock/cond, use ccan/list (rb_mutex_num_waiting): new function for debug_deadlock_check (mutex_free): remove native_destroy (mutex_alloc): initialize waitq, remove nativeinitialize (rb_mutex_trylock): remove native_mutex{lock,unlock} (lock_func): remove (lock_interrupt): remove (rb_mutex_lock): rewrite waiting path to use native_sleep + ccan/list (rb_mutex_unlock_th): rewrite to wake up from native_sleep using rb_threadptr_interrupt (rb_mutex_abandon_all): empty waitq
  • thread.c (debug_deadlock_check): update for new struct (rb_check_deadlock): ditto [Feature #13517]

Revision 59284
Added by normal 7 months ago

NEWS: note [Feature #13517] is Linux-only (no side-effects on _*nonblock)

Revision 59385
Added by normal 6 months ago

NEWS: add entries for thread_sync.c changes

I'm slightly worried about some external code subclassing
ConditionVariable, Queue, and SizedQueue and relying on them
being Structs. However, they only started being Structs with
Ruby 2.1, and were implemented in pure Ruby before that; so
hopefully nobody notices that implementation detail.

Also, note the Mutex change as it may affect program design
when space can be saved.

History

#1 [ruby-core:80918] Updated by normalperson (Eric Wong) 9 months ago

normalperson@yhbt.net wrote:

https://bugs.ruby-lang.org/issues/13517

For who care about 32-bit, single-core x86, here are my
Pentium M (Centrino) @ 1.6GHz numbers:

Size reduction of Mutex on 32-bit is 112 => 40 bytes

minimum results in each 3 measurements.
Execution time (sec)
name trunk built
loop_whileloop2 0.554 0.554
vm2_mutex* 3.136 2.217
vm_thread_mutex1 2.783 2.186
vm_thread_mutex2 2.907 2.174
vm_thread_mutex3 9.740 2.586

Speedup ratio: compare with the result of `trunk' (greater is better)
name built
loop_whileloop2 0.999
vm2_mutex* 1.414
vm_thread_mutex1 1.273
vm_thread_mutex2 1.337
vm_thread_mutex3 3.766

In the future, I think the cond_waiting flag can be moved into
a FL_USER flag, too.

But I also want to try similar changes to avoid Array usage in
Queue, SizedQueue, and ConditionVariable classes and rely on
ccan/list + stack for waiters. I will convert from T_STRUCT to
T_DATA.

#2 [ruby-core:80973] Updated by normalperson (Eric Wong) 9 months ago

normalperson@yhbt.net wrote:

Feature #13517: [PATCH] reduce rb_mutex_t size from 160 to 80 bytes on 64-bit
https://bugs.ruby-lang.org/issues/13517

Any comment? I would like to commit this, soon.

Thanks.

#3 [ruby-core:80974] Updated by ko1 (Koichi Sasada) 9 months ago

At a glance, it seems nice.
But I need to time to check deeply.
I'll check with 'Misc #13514'.

Please wait these days. In Japan, now we have holiday week. I'll check on these days.

Thanks,
Koichi

#4 [ruby-core:81024] Updated by ko1 (Koichi Sasada) 9 months ago

sorry for late response.
I have no objection about this patch. thank you.

one question.

    list_for_each_safe(&mutex->waitq, cur, next, node) {
        list_del_init(&cur->node);
        switch (cur->th->state) {
        case THREAD_KILLED:
        continue;
        case THREAD_STOPPED:
        case THREAD_RUNNABLE:
        case THREAD_STOPPED_FOREVER:
        rb_threadptr_interrupt(cur->th);
        goto found;
        }
    }

rb_mutex_lock() set th->status as THREAD_STOPPED_FOREVER before
native sleep, but the above code quoted from rb_mutex_unlock_th().

What kind of situation do you assume when the thread status is other
than THREAD_STOPPED_FOREVER?

Thanks,
Koichi

--
// SASADA Koichi at atdot dot net

#5 [ruby-core:81025] Updated by normalperson (Eric Wong) 9 months ago

SASADA Koichi ko1@atdot.net wrote:

sorry for late response.
I have no objection about this patch. thank you.

one question.

 list_for_each_safe(&mutex->waitq, cur, next, node) {
     list_del_init(&cur->node);
     switch (cur->th->state) {

Oops, that should be status, not state:

    switch (cur->th->status) {
 case THREAD_KILLED:
 continue;
 case THREAD_STOPPED:
 case THREAD_RUNNABLE:
 case THREAD_STOPPED_FOREVER:
 rb_threadptr_interrupt(cur->th);
 goto found;
 }

}
``
rb_mutex_lock()setth->statusasTHREAD_STOPPED_FOREVERbefore
native sleep, but the above code quoted from
rb_mutex_unlock_th()`.

What kind of situation do you assume when the thread status is other
than THREAD_STOPPED_FOREVER?

Back to your original question. THREAD_RUNNABLE is possible
if somebody uses Thread#run:

require 'thread'
m = Mutex.new
th = Thread.new do
sleep 0.1 # wait for main thread to get lock
m.synchronize do
sleep
end
end

m.synchronize do
sleep 0.2 # wait for th to block on m.synchronize
th.run
end

I am not sure about other statuses. Maybe exit/GC can trigger
THREAD_KILLED, the mutex_free->rb_mutex_unlock_th call chain
looks like it might due to GC ordering. Anyways, I will add
comments here when I commit.

Thanks,
Koichi

Thank you for the review!

#6 [ruby-core:81026] Updated by ko1 (Koichi Sasada) 9 months ago

On 2017/05/08 8:08, Eric Wong wrote:

Back to your original question. THREAD_RUNNABLE is possible
if somebody uses Thread#run:

require 'thread'
m = Mutex.new
th = Thread.new do
  sleep 0.1 # wait for main thread to get lock
  m.synchronize do

sleep
end
end

m.synchronize do
  sleep 0.2 # wait for th to block on m.synchronize
  th.run
end

I also confirm that this code set THREAD_RUNNABLE. However, th waits
locking forever, current Thread#run should be bug. mmmm. But not so
serious because it is only small period (maybe as you know). We should
modify later.

I am not sure about other statuses. Maybe exit/GC can trigger
THREAD_KILLED, the mutex_free->rb_mutex_unlock_th call chain
looks like it might due to GC ordering. Anyways, I will add
comments here when I commit.

I think adding rb_bug() guard is good to know the flow of such situation.

--
// SASADA Koichi at atdot dot net

#7 Updated by Anonymous 9 months ago

  • Status changed from Open to Closed

Applied in changeset trunk|r58604.


reduce rb_mutex_t size from 160 to 80 bytes on 64-bit

Instead of relying on a native condition variable and mutex for
every Ruby Mutex object, use a doubly linked-list to implement a
waiter queue in the Mutex. The immediate benefit of this is
reducing the size of every Mutex object, as some projects have
many objects requiring synchronization.

In the future, this technique using a linked-list and on-stack
list node (struct mutex_waiter) should allow us to easily
transition to M:N threading model, as we can avoid the native
thread dependency to implement Mutex.

We already do something similar for autoload in variable.c,
and this was inspired by the Linux kernel wait queue (as
ccan/list is inspired by the Linux kernel linked-list).

Finaly, there are big performance improvements for Mutex
benchmarks, especially in contended cases:

measure target: real

name trunk built
loop_whileloop2 0.149 0.148
vm2_mutex* 0.893 0.651
vm_thread_mutex1 0.809 0.624
vm_thread_mutex2 2.608 0.628
vm_thread_mutex3 28.227 0.881

Speedup ratio: compare with the result of `trunk' (greater is better)

name built
loop_whileloop2 1.002
vm2_mutex* 1.372
vm_thread_mutex1 1.297
vm_thread_mutex2 4.149
vm_thread_mutex3 32.044

Tested on AMD FX-8320 8-core at 3.5GHz

  • thread_sync.c (struct mutex_waiter): new on-stack struct (struct rb_mutex_struct): remove native lock/cond, use ccan/list (rb_mutex_num_waiting): new function for debug_deadlock_check (mutex_free): remove native_destroy (mutex_alloc): initialize waitq, remove nativeinitialize (rb_mutex_trylock): remove native_mutex{lock,unlock} (lock_func): remove (lock_interrupt): remove (rb_mutex_lock): rewrite waiting path to use native_sleep + ccan/list (rb_mutex_unlock_th): rewrite to wake up from native_sleep using rb_threadptr_interrupt (rb_mutex_abandon_all): empty waitq
  • thread.c (debug_deadlock_check): update for new struct (rb_check_deadlock): ditto [Feature #13517]

Also available in: Atom PDF