Project

General

Profile

Actions

Bug #14858

closed

Introduce 2nd GC heap named Transient heap

Added by ko1 (Koichi Sasada) almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Target version:
[ruby-core:87538]

Description

Abstract

We propose to introduce "2nd" GC managed heap named "Transient heap" into MRI, instead of malloc management heap.
Employied GC algorithm is similar to "generational" "copying" GC algorithm.
This technique can reduce problems of malloc'ed heap.

Background

MRI and malloc

MRI manages memory by GC managed heap and malloc'ed heap.

Objects are allocated from GC managed heap. However, each object has 3 extra words to store data because each object slot only has 5 words (2 words for internal use). Extra data are allocated from malloc'ed heap and free the malloc'ed memory object when the object died.

For example, an array object consists of equal or less than 3 elements, it doesn't need malloc because we can use 3 slots to store them into an object slot. However, if an array object consists of greater than 3 elements, it require malloc'ed heap.

Using malloc'ed heap is easy way to implement, however there are several disadvantages.

Introducing custom malloc library is one option, but it should be tough work to tune with system specific knowlege and to maintain it.

Generational Hypothesis

Most of objects die in young. This is common obserbation used by generational GC.
And we can see same behavior at least rdoc measurement https://bugs.ruby-lang.org/issues/14857.

Can we use this nature?

Copy GC algorithm

Copy GC algorithm is known as one of fastest GC because:

  • Sweep time is proportional to living objects number.
  • Copying memory objects can increase space locality.
  • Copying becomes compaction so no fragmentation issue.

However, MRI can't employ copying algorithm because of C implementation limitation (implementation/compatibility issue).
Mostly copying (compaction) GC is suggested several times.

Copying GC can be combine with generational technique (collect only young objects).

Bump allocation

Fastest memory allocation algorithm is "bump allocation".

  • (Init) allocate memory area and initialize index as 0
  • (Alloc) increment index with required size and return original index.

Of course, this algorithm has many problem, especially fragmentation.

Design

Provide another generational copying GC heap for MRI and use it if we can use.
This proposal shows how to make a generational copying GC and use it from MRI.

The ideas are:

  • (1) Provide copy GC area beside normal GC managed area.
  • (2) Marking memory objects at normal GC marking.
  • (3) Promote long-lived memory objects and copy them into malloc'ed heap.
  • (4) Copy marked memory objects at interrupt point.

(3) and (4) are different from normal (generational) copying GCs.

Our proposal has several advantages:

  • Fast allocation: We can employ bump allocation for memory allocation.
  • Fast free: We don't need to touch free'ed memory objects.
  • Early free: We don't need to wait lazy sweep to free spaces.
  • Generational: Long lived memory objects are copied to malloc'ed heap.
  • Controllable heap: We can manage memory buffers allocated by system (for short-lived memory objects).
  • Easy maintenance: We can use system's malloc library for long living objects.
  • Compatible: Our proposal is compatible with current MRI and existing C extensions.

We named this 2nd managed heap as "Transient heap" because this heap only manage short-living objects.

Algorithm

The following is outline of our proposed algorithm:

(1) Prepare memory area.
(2) rb_transient_heap_alloc() returns memory objects in memory area using bump allocation and mark object as transient object.
Objects using transient heap are marked as "transient flag".
(2') If there are no space to return, then return NULL.
(3) Start GC marking. If marked object uses transient heap (marked by transient flag), tell object and allocated pointer to transient heap by rb_transient_heap_mark().
(4) After marking, wait until interrupt point. De-transient (allocate memory by malloc and copy from transient heap to malloc'ed memory) for all marked memory objects.
After de-transient all marked objects, then the memory area is free and reuse it.
(5) goto (2)

We need to wait interrupt point to de-transient because we can't move memory objects just after GC marking.

For example,

const VALUE *ptr = RARRAY_CONST_PTR(ary);
VALUE *tmp = ALLOC_N(size); // can cause GC
use(ptr);                   // ptr should point valid area

we can't move memory area between GC (marking).

However, we can move memory objects across interrupt point because any Ruby program can run at interrupt point and such Ruby program can invalidate raw pointers.

For example,

const VALUE *ptr = RARRAY_CONST_PTR(ary);
VALUE v = rb_funcall(...); // call Ruby method and can pass over interrupt point.
  // at interrupt point, if ary.replace(...) or something destructive operations can run here.
use(ptr);                  // ptr can be invalidated

This case, we need to care that ptr can be invalidated even if we are using current MRI.

Scenario

To show the algorithm, let's discuss with Array objects.

(1) Allocate an array

Array.new(4) allocates 4 words by malloc with current MRI.
Instead of using malloc', call rb_transient_heap_alloc(size)` to allocate a memory.
Returning pointer is allocated pointer from transient heap.
Set "Transient flag" to a created array object and set pointer.

(2-1) Free array object

If an array object is a short-lived object, then we don't do any more. We don't need to call free.

(2-2) Mark and deserialize

If an array object is long-lived object (survive one GC), we need to make this object survive.
So that GC marking phase, when an array is marked and the array is transient flag,
then call rb_transient_heap_mark() to tell transient heap that the object is living.

(3) Cleanup memory area for transient heap

Wait for an interrupt point (RUBY_VM_CHECK_INTS()) and de-transient for all marked (living) objects.
Transient heap has no living memory objects so that we can reuse this heap area.

Details

The above explanation is simplified description. See source code for details.

Points are:

  • Using blocks instead of using one big memory area for transient heap.
  • Separate "using blocks" and "marked blocks".
  • Prepare promoted objects table to take care promoted objects.

Copying not just after marking was difficult than I expected...

Compatibility

Some code can rely on that raw pointer acquired from a frozen object can not be invalidate.
We need to survey more.

Result

Now only Array uses transient heap because of limitation of human resource.

micro-benchmark

require 'benchmark'
N = 10_000_000

def n_times str, args = ''
  eval <<-EOS
    proc{|max, #{args}|
      i = 0
      while i < max
        #{str}
        i+=1
      end
    }
  EOS
end

m = n_times 'ary = Array.new(size)', 'size'

Benchmark.bm(10){|x|
  0.step(to: 16){|i|
    size = i
    x.report(size){
      m.call(N, size)
    }
  }
}

This microbenchmark measure the time Array.new(n) (10M times) where 0 <= n <= 10.

Current MRI:

                 user     system      total        real
0            0.973336   0.000000   0.973336 (  0.969181)
1            1.005509   0.000000   1.005509 (  1.004734)
2            1.021980   0.000000   1.021980 (  1.025868)
3            1.153219   0.003657   1.156876 (  1.157852)
4            1.814474   0.004027   1.818501 (  1.822703)
5            1.882669   0.000000   1.882669 (  1.882087)
6            1.979505   0.003985   1.983490 (  1.984733)
7            1.960319   0.000000   1.960319 (  1.959602)
8            2.032639   0.000000   2.032639 (  2.033494)
9            2.160479   0.003994   2.164473 (  2.163686)
10           2.045469   0.008000   2.053469 (  2.054952)
11           1.901561   0.000000   1.901561 (  1.901080)
12           1.868254   0.004001   1.872255 (  1.872298)
13           1.864242   0.004001   1.868243 (  1.867719)
14           1.867359   0.004000   1.871359 (  1.871252)
15           1.849659   0.004004   1.853663 (  1.853597)
16           1.940981   0.003995   1.944976 (  1.946946)

Using proposed transient heap:

                 user     system      total        real
0            0.948643   0.000000   0.948643 (  0.951356)
1            0.945786   0.000000   0.945786 (  0.945294)
2            0.967074   0.000000   0.967074 (  0.967439)
3            0.964790   0.003212   0.968002 (  0.968342)
4            1.177959   0.000013   1.177972 (  1.178215)
5            1.200862   0.000006   1.200868 (  1.200600)
6            1.190707   0.000000   1.190707 (  1.190252)
7            1.297028   0.000000   1.297028 (  1.297079)
8            1.256999   0.000006   1.257005 (  1.256636)
9            1.364489   0.000002   1.364491 (  1.368413)
10           1.294030   0.000002   1.294032 (  1.293557)
11           1.392180   0.000000   1.392180 (  1.392618)
12           1.347170   0.000000   1.347170 (  1.347101)
13           1.426268   0.000000   1.426268 (  1.425432)
14           1.330723   0.000004   1.330727 (  1.333744)
15           1.458826   0.000000   1.458826 (  1.458509)
16           1.349267   0.000000   1.349267 (  1.348561)

Notable change is here:

# current MRI:
3            1.153219   0.003657   1.156876 (  1.157852)
4            1.814474   0.004027   1.818501 (  1.822703)

# transient heap:
3            0.964790   0.003212   0.968002 (  0.968342)
4            1.177959   0.000013   1.177972 (  1.178215)

We don't need to allocate extra heap for 3 elements.
With 4 elements the first element size requiring extra memory, the impact of transient heap is easy to see.

app benchmark

Run make install gcbench-rdoc.

Current MRI:

../../clean-trunk/benchmark/gc/rdoc.rb
      user     system      total        real
 15.599328  14.789680  30.389008 ( 35.894391)
GC total time (sec): 3.7352798950005948

VmHWM: 501916 kB

Summary of rdoc on 2.6.0dev     35.894391419016756      3.7352798950005948      167
         (real time in sec, GC time in sec, GC count)

Transient heap:

../../trunk/benchmark/gc/rdoc.rb
      user     system      total        real
 22.403497   6.122249  28.525746 ( 33.900175)
GC total time (sec): 3.0769403599987992

VmHWM: 408460 kB

Summary of rdoc on 2.6.0dev     33.90017474500928       3.0769403599987992      184
         (real time in sec, GC time in sec, GC count)

Surprisingly, VmHWM is reduced well (not surveyed yet).

No notable speedup.

Futurework

We need to implement more and mroe.

  • Support String (most important)
  • Support Hash (st), but it seems difficult
  • Extend (shrink) policy for transient heap (now fixed size heap)

Conclusion

This ticket propose 2nd heap managed by generational copying GC named "Transient heap".
Transient heap only contains young memory objects and old memory objects (pointed from promoted objects) are escaped to malloc'ed memory.
Most of memory objects are died in young age (generatioal hypothesis) so that

I hope this proposal can solve malloc'ed heap problems.


Files

transient_heap.patch (45.5 KB) transient_heap.patch ko1 (Koichi Sasada), 06/21/2018 03:11 AM
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0