Project

General

Profile

Misc #16125

Updated by nobu (Nobuyoshi Nakada) about 2 months ago

References PR https://github.com/ruby/ruby/pull/2396

I noticed that since the introduction of the `GC.compact` API, struct `rb_data_type_t` spans multiple cache lines with the introduction of the `dcompact` function pointer / callback:

```C ```
struct rb_data_type_struct {
const char * wrap_struct_name; /* 0 8 */
struct {
void (*dmark)(void *); /* 8 8 */
void (*dfree)(void *); /* 16 8 */
size_t (*dsize)(const void *); /* 24 8 */
void (*dcompact)(void *); /* 32 8 */ <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
void * reserved[1]; /* 40 8 */
} function; /* 8 40 */
const rb_data_type_t * parent; /* 48 8 */
void * data; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
VALUE flags; /* 64 8 */

/* size: 72, cachelines: 2, members: 5 */
/* last cacheline: 8 bytes */
};
```

I'm wondering what the `reserved` member was originally intended for, given introducing the `dcompact` member basically already broke binary compatibility by changing the struct size from `64` -> `72` bytes when preserving the `reserved` member as well.

This struct is defined in `include/ruby.h` and used extensively in MRI but also extensions and thus "public API". If there's the off chance that there isn't a need for the reserved member moving forward (maybe could have been for compacting or a similar GC feature?), could we remove it and prefer aligning on cache line boundaries instead?

Packed with the `reserved` member removed, single cache line:

```C ```
struct rb_data_type_struct {
const char * wrap_struct_name; /* 0 8 */
struct {
void (*dmark)(void *); /* 8 8 */
void (*dfree)(void *); /* 16 8 */
size_t (*dsize)(const void *); /* 24 8 */
void (*dcompact)(void *); /* 32 8 */
} function; /* 8 32 */
const rb_data_type_t * parent; /* 40 8 */
void * data; /* 48 8 */
VALUE flags; /* 56 8 */

/* size: 64, cachelines: 1, members: 5 */
};
```

### Usage in MRI

Examples of internal APIs that use it and how the typed data type declarations does not affect the tail of the function struct with the style used in MRI (I realize this may not be true for all extensions):

#### AST

```C ```
static const rb_data_type_t rb_node_type = {
"AST/node",
{node_gc_mark, RUBY_TYPED_DEFAULT_FREE, node_memsize,},
0, 0,
RUBY_TYPED_FREE_IMMEDIATELY,
};
```

#### Fiber

```C ```
static const rb_data_type_t fiber_data_type = {
"fiber",
{fiber_mark, fiber_free, fiber_memsize, fiber_compact,},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

#### Enumerator

And related generator etc. types.

```C ```
static const rb_data_type_t enumerator_data_type = {
"enumerator",
{
enumerator_mark,
enumerator_free,
enumerator_memsize,
enumerator_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

#### Encoding

```C ```
static const rb_data_type_t encoding_data_type = {
"encoding",
{0, 0, 0,},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

#### Proc, Binding and methods

```C ```
static const rb_data_type_t proc_data_type = {
"proc",
{
proc_mark,
RUBY_TYPED_DEFAULT_FREE,
proc_memsize,
proc_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED
};
```

```C ```
const ruby_binding_data_type = {
"binding",
{
binding_mark,
binding_free,
binding_memsize,
binding_compact,
},
0, 0, RUBY_TYPED_WB_PROTECTED | RUBY_TYPED_FREE_IMMEDIATELY
};
```

```C ```
static const rb_data_type_t method_data_type = {
"method",
{
bm_mark,
RUBY_TYPED_DEFAULT_FREE,
bm_memsize,
bm_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

#### Threads

```C ```
#define thread_data_type ruby_threadptr_data_type
const rb_data_type_t ruby_threadptr_data_type = {
"VM/thread",
{
thread_mark,
thread_free,
thread_memsize,
thread_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

And *many* others both internal and in `ext/`. Looking at the definitions in MRI at least, I don't see:

* patterns of any typed data definition explicitly initializing the `reserved` member
* how this would affect "in the wild" extensions negatively as the more popular ones I referenced also followed the MRI init style.

### Benchmarks

Focused from the standard bench suite on typed data objects as mentioned above.

Prelude:

```
lourens@CarbonX1:~/src/ruby/ruby$ make benchmark COMPARE_RUBY=~/src/ruby/trunk/ruby OPTS="-v --repeat-count 10"
./revision.h unchanged
/usr/local/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
--executables="compare-ruby::/home/lourens/src/ruby/trunk/ruby -I.ext/common --disable-gem" \
--executables="built-ruby::./miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -- --disable-gems --disable-gem" \
$(find ./benchmark -maxdepth 1 -name '' -o -name '**.yml' -o -name '**.rb' | sort) -v --repeat-count 10
compare-ruby: ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_64-linux]
built-ruby: ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux]
```

Left side `compare-ruby` (master), right side `current` (this branch):

```
require_thread 0.035 0.049 i/s - 1.000 times in 28.932403s 20.426896s
vm1_blockparam_call 18.885M 18.907M i/s - 30.000M times in 1.588571s 1.586713s
vm1_blockparam_pass 15.159M 15.434M i/s - 30.000M times in 1.978964s 1.943805s
vm1_blockparam_yield 20.560M 20.673M i/s - 30.000M times in 1.459127s 1.451188s
vm1_blockparam 32.733M 33.358M i/s - 30.000M times in 0.916513s 0.899344s
vm1_block 33.796M 34.215M i/s - 30.000M times in 0.887692s 0.876808s
vm2_fiber_reuse_gc 98.480 104.688 i/s - 100.000 times in 1.015439s 0.955219s
vm2_fiber_reuse 364.082 397.878 i/s - 200.000 times in 0.549327s 0.502667s
vm2_fiber_switch 11.548M 11.730M i/s - 20.000M times in 1.731852s 1.704978s
vm2_proc 36.025M 36.278M i/s - 6.000M times in 0.166552s 0.165389s
vm_thread_alive_check 108.273k 109.290k i/s - 50.000k times in 0.461794s 0.457499s
vm_thread_close 1.415 1.432 i/s - 1.000 times in 0.706720s 0.698509s
vm_thread_condvar1 1.287 1.287 i/s - 1.000 times in 0.776782s 0.777074s
vm_thread_condvar2 1.653 1.615 i/s - 1.000 times in 0.604922s 0.619380s
vm_thread_create_join 0.913 0.921 i/s - 1.000 times in 1.094693s 1.085227s
vm_thread_mutex1 2.537 2.581 i/s - 1.000 times in 0.394181s 0.387481s
vm_thread_mutex2 2.571 2.577 i/s - 1.000 times in 0.388932s 0.388020s
vm_thread_mutex3 1.110 1.660 i/s - 1.000 times in 0.900852s 0.602422s
vm_thread_pass_flood 5.867 9.997 i/s - 1.000 times in 0.170431s 0.100032s
vm_thread_pass 0.349 0.350 i/s - 1.000 times in 2.865303s 2.854191s
vm_thread_pipe 6.923 7.093 i/s - 1.000 times in 0.144447s 0.140993s
vm_thread_queue 1.297 1.287 i/s - 1.000 times in 0.771302s 0.777274s
vm_thread_sized_queue2 1.538 1.479 i/s - 1.000 times in 0.650188s 0.676074s
vm_thread_sized_queue3 1.421 1.456 i/s - 1.000 times in 0.703753s 0.686595s
vm_thread_sized_queue4 1.347 1.342 i/s - 1.000 times in 0.742653s 0.745130s
vm_thread_sized_queue 5.473 5.377 i/s - 1.000 times in 0.182710s 0.185966s
```

### Further cache utilization info

Used `perf stat` on a rails console using the integration session helper to load the redmine homepage 100 times (removes network roundtrip and other variance and easier to reproduce for reviewers - less tools).

Master

```
lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production
Loading production environment (Rails 5.2.3)
irb(main):001:0> 100.times { app.get('/') }
----- truncated -----
Processing by WelcomeController#index as HTML
Current user: anonymous
Rendering welcome/index.html.erb within layouts/base
Rendered welcome/index.html.erb within layouts/base (0.5ms)
Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.3ms)
=> 100
irb(main):002:0> RUBY_DESCRIPTION
=> "ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_64-linux]"
irb(main):003:0> exit

Performance counter stats for 'bin/rails c -e production':

4373,155316 task-clock (msec) # 0,093 CPUs utilized
819 context-switches # 0,187 K/sec
30 cpu-migrations # 0,007 K/sec
82376 page-faults # 0,019 M/sec
13340422873 cycles # 3,051 GHz (50,18%)
17274934973 instructions # 1,29 insn per cycle (62,74%)
3558147880 branches # 813,634 M/sec (62,42%)
77703222 branch-misses # 2,18% of all branches (62,39%)
4625597415 L1-dcache-loads # 1057,725 M/sec (62,22%)
216886763 L1-dcache-load-misses # 4,69% of all L1-dcache hits (62,54%)
66242477 LLC-loads # 15,148 M/sec (50,19%)
13766303 LLC-load-misses # 20,78% of all LL-cache hits (50,05%)

47,171186591 seconds time elapsed
```

This branch:

```
lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production
Loading production environment (Rails 5.2.3)
irb(main):001:0> 100.times { app.get('/') }
----- truncated -----
Started GET "/" for 127.0.0.1 at 2019-08-20 23:40:43 +0100
Processing by WelcomeController#index as HTML
Current user: anonymous
Rendering welcome/index.html.erb within layouts/base
Rendered welcome/index.html.erb within layouts/base (0.6ms)
Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.4ms)
=> 100
irb(main):002:0> p RUBY_DESCRIPTION
"ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux]"
=> "ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux]"
irb(main):003:0> exit

Performance counter stats for 'bin/rails c -e production':

4318,441633 task-clock (msec) # 0,112 CPUs utilized
599 context-switches # 0,139 K/sec
14 cpu-migrations # 0,003 K/sec
81011 page-faults # 0,019 M/sec
13241070220 cycles # 3,066 GHz (49,56%)
17323594358 instructions # 1,31 insn per cycle (62,27%)
3553794043 branches # 822,934 M/sec (62,89%)
76390145 branch-misses # 2,15% of all branches (63,12%)
4595415722 L1-dcache-loads # 1064,138 M/sec (62,83%)
202269349 L1-dcache-load-misses # 4,40% of all L1-dcache hits (62,66%)
66193702 LLC-loads # 15,328 M/sec (49,44%)
12548399 LLC-load-misses # 18,96% of all LL-cache hits (49,49%)

38,464764876 seconds time elapsed
```

Conclusions:

* Minor improvement in instructions per cycle
* `L1-dcache-loads`: `1057,725 M/sec` -> `1064,138 M/sec` (higher rate of L1 cache loads)
* `L1-dcache-load-misses`: `4,69%` -> `4,40%` (reduced L1 cache miss rate)

Thoughts?

Back