Bug #10009
openIO operation is 10x slower in multi-thread environment
Description
I created this issue #9832 but not have io operation.
In the script attached I simulate IO operation in multi-thread environment.
For ruby 1.9.2 apply taskset -c -p 2 #{Process.pid}
for regulates threads behavior.
The second Thread is a io operation
My results:
-
ruby 2.1.2
first 43500194
second 95
third 42184385 -
ruby-2.0.0-p451
first 38418401
second 95
third 37444470 -
1.9.3-p545
first 121260313
second 50
third 44275164 -
1.9.2-p320
first 31189901
second 897 <============
third 31190598
Regards
Alexandre Riveira
Files
Updated by ariveira (Alexandre Riveira) over 10 years ago
My environment is Debian 3.2.0-4-amd64
Updated by ariveira (Alexandre Riveira) over 10 years ago
Alexandre Riveira wrote:
I applied tests using Rubinius.
Rubinius uses only 1 processor due to applied taskset, results:
first 18164692
second 10007 <==========
third 18184825
Updated by normalperson (Eric Wong) over 10 years ago
I'll try resurrecting an old eventfd proposal and maybe also bare futexes
to see if that improves things.
Updated by ariveira (Alexandre Riveira) over 10 years ago
- File teste_thread_schedule.py teste_thread_schedule.py added
- File teste_thread_schedule.rb teste_thread_schedule.rb added
Eric Wong wrote:
I'll try resurrecting an old eventfd proposal and maybe also bare futexes
to see if that improves things.
Tank's Eric,
If an application running Rainbows has only one thread using 100% the worker is affected greatly in the query database. The solution is to try the fork worker for heavy tasks but often this is not possible.
GIL in Pyhton is better but use 160% of cpu and ruby use 100% of cpu.
Updated by normalperson (Eric Wong) over 10 years ago
- File test_thread_sched_pipe.rb test_thread_sched_pipe.rb added
- Description updated (diff)
eventfd doesn't help performance (but still reduces FD count),
I never expected eventfd to improve speed, though.
Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.
Attached is a I/O bench using pipes without Postgres requirement.
Increasing GVL (or any lock) performance is tricky because we need to
balance fairness and avoid starvation cases. The GVL was rewritten to
avoid starvation in 1.9.3, so that's likely the cause of the major
difference starting with 1.9.3.
I doubt I can noticeably improve performance with futexes vs mutex/condvar.
How much does GVL performance between 1.9.2 and 2.1 affect real-world
performance on Rainbows!/yahns apps for you? (not "hello world"-type
apps).
I hope to make GVL optional in a few years, but that is tricky.
Ironically, part of the reason I don't like GVL is I don't want to pay
any threading/locking costs for tiny single-threaded apps, either :)
Updated by ariveira (Alexandre Riveira) over 10 years ago
My application is not web-site is an ERP. So reporting and very heavy tasks are performed. Then the system crashes because only one thread using 100% cpu will damage the whole worker, passing any request for at least 2 seconds, then the requests go piling.
The key point is a thread using 100% of cpu will make all worker threads just make a few requests for postgres.
Ruby Without GVL is that possible??
I believe python works best because it uses part of another cpu (160%) to manage all the threads.
Doing the same test with pypy it uses 100% cpu as ruby and presents the same problems as ruby.
Updated by ariveira (Alexandre Riveira) over 10 years ago
information that I consider important
Kernels BFS and ruby 1.9.2 work fine as if applied taskset.
Other kernels like freebsd and macos with ruby 1.9.2 has similar behavior.
http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
https://wiki.archlinux.org/index.php/linux-ck
Updated by ariveira (Alexandre Riveira) over 10 years ago
Alexandre Riveira wrote:
information that I consider important
Kernels BFS and ruby 1.9.2 work fine as if applied taskset.
Other kernels like freebsd and macos with ruby 1.9.2 has similar behavior.http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
https://wiki.archlinux.org/index.php/linux-ck
My results kernel linux BFS/CK + taskset
first 103214331
second 2762 <======
third 24259986
Updated by ariveira (Alexandre Riveira) over 10 years ago
Eric Wong wrote:
Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.
Hello Eric!!!!
I stayed enjoyed the result of change TIME_QUANTUM_USEC. Changed its value to 1000 only see the results:
ruby 2.
first 17434583
second 2754 <=============
third 16752441
If you have any problems I will try 10 * 1000.
It seems incredible because there was no need to apply taskset.
As this is a microbenchmark'll do the tests and if all goes well put into production. After I report news.
Updated by ariveira (Alexandre Riveira) over 10 years ago
Alexandre Riveira wrote:
Eric Wong wrote:
Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.Hello Eric!!!!
I stayed enjoyed the result of change TIME_QUANTUM_USEC. Changed its value to 1000
Tests completes, my system without changes join stress tests 30 secons for load page, after changes, pages loading in instant all pages loading in less than 1 second.
Updated by normalperson (Eric Wong) over 10 years ago
Good to know it works for you. Keep in mind TIME_QUANTUM_USEC=1000 is
very low and may cause problems on some systems, too.
My gut feeling is 100ms (default) is too high, but 10ms is too low
(based on kosaki's comment). Maybe 20ms - 50ms is acceptable. There is
a wide variety of configuration we must work with (even just on Linux).
Can you try 20-50ms?
About GVL:
Replacing GVL with fine-grained locks is possible (and ko1 tried it),
but performance suffered for single-thread cases.
It should be possible to do with lock-free techniques, but that is
difficult to get right.
Updated by ariveira (Alexandre Riveira) over 10 years ago
Hi Eric !
Eric Wong wrote:
Good to know it works for you. Keep in mind TIME_QUANTUM_USEC=1000 is
What problems do I have?
Can you try 20-50ms?
In the application do a stress test where 5 threads overload.
I tested 50 and the latency is over the next 15 seconds.
I tested the latency is 20 and next 10 seconds.
I tested the latency is 10 and next 4 or 5 seconds.
The magic number is TIME_QUANTUM_USEC=1000. There is no latency in this case
Follow microbenckmars teste_thread_schedule_2 with postgres
TIME_QUANTUM_USEC (1000)
first 22882400
second 2654 <===
third 22642172
in 21.08 seconds
2654 / 21.08 is 125 connections for database per second
TIME_QUANTUM_USEC (20 * 1000)
first 33003617
second 258 <==
third 33851933
in 23.07 seconds
258 / 23.07 is 11 connections for database per second. I think this small amount of connections per second but accept comments.
TIME_QUANTUM_USEC (50 * 1000)
first 42811975
second 116
third 42005480
in 25.12 seconds
116 / 25.12 is 5 connections for database per second.
Updated by normalperson (Eric Wong) about 10 years ago
normalperson@yhbt.net wrote:
I doubt I can noticeably improve performance with futexes vs mutex/condvar.
Totally not-speed-optimized futex-based lock/condvar implementation at
git://bogomips.org/ruby.git (futex branch)
http://bogomips.org/ruby.git/patch?id=ae93c50c8de
I am not sure if my implementation is correct, but "make check" passes
with both 8 cores and 1 core active (8-core Vishera). I will probably
write an independent (C-only) test for more parallelism and maybe steal
some from glibc (I also plan on using this futex-based lock
implementation outside of Ruby).
Benchmarks don't seem to show much (if any) improvement, yet. Speed
improvement from reimplementing GVL around bare futex interface may be
possible (w/o using separate condvar/mutex layer).
On amd64 GNU/Linux, pthread_mutex_t is 40 bytes, but these futex-based
locks only need 4 bytes. Similarly, pthread_cond_t is 48 bytes, making
rb_nativethread_cond_t 56 bytes with pthreads; this futex implementation
currently requires only 16 bytes for a condvar.
Size improvement may be noticeable for some apps with many Mutexes:
the lock/cond reductions mean rb_mutex_struct is now 48 bytes instead
of 128 bytes.
Updated by ariveira (Alexandre Riveira) about 10 years ago
- File test_thread_sched.rb test_thread_sched.rb added
I rewrote the test, I created the --tasket --postgres arguments and to use the same test file.
Feel free to change whatever you want.
Soon bring news about the test with futex
Updated by ariveira (Alexandre Riveira) about 10 years ago
- File test_thread_sched.rb test_thread_sched.rb added
- File tests.txt tests.txt added
I added in the uname test script for details kernel / platform
Follow the accompanying tests
tests (test_thread_sche.rb --postgres) in debian-kfreebsd-amd64
ruby 1.9.2
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 101480933
postgres.......: 467
time...........: 20.232985931 (ideal value of 20 seconds)
ruby 2.1.2
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 71870185
postgres.......: 58
time...........: 21.123303293 (ideal value of 20 seconds)
ruby 2.1.2 with TIME_QUANTUM_USEC = 1000
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 63996063
postgres.......: 2510
time...........: 20.050760184 (ideal value of 20 seconds)
Updated by normalperson (Eric Wong) about 10 years ago
Some tests adapted from glibc:
git clone git://80x24.org/rb_futex_test
tst-cond18-f/p are micro benchmarks, -f (futex version) is roughly
twice a fast as the -p (pthreads version); but that doesn't seem
to translate to noticeable real-world speed improvements in Ruby.
Updated by ariveira (Alexandre Riveira) about 10 years ago
Following script in python to buy blocking io python x ruby
Results:
ruby without changes TIME_QUANTUM_USEC (100 * 1000)
first..........: 32445253
second.........: 30660119
postgres.......: 61
time...........: 1.5022704 secs
ruby with TIME_QUANTUM_USEC (1 * 1000)
first..........: 17793384
second.........: 17438453
postgres.......: 4638
python
first 17498064
postgres 2027
third 18702539
Updated by hsbt (Hiroshi SHIBATA) over 9 years ago
- Assignee set to ko1 (Koichi Sasada)
- Priority changed from 6 to Normal
Updated by hsbt (Hiroshi SHIBATA) 7 months ago
- Status changed from Open to Assigned