Project

General

Profile

Actions

Feature #14767

closed

[PATCH] gc.c: use monotonic counters for objspace_malloc_increase

Added by normalperson (Eric Wong) almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Target version:
-
[ruby-core:87091]

Description

gc.c: use monotonic counters for objspace_malloc_increase

atomic_sub_nounderflow is expensive and objspace_malloc_increase
was showing up near the top of some perf profiles. The new
implementation allows the compiler to inline and eliminate
some branches from objspace_malloc_increase.

This consistently improves bm_so_count_words benchmark by
around 10% on my hardware.

name built
so_count_words 1.107

We started this discussion back in [ruby-core:61508]
https://public-inbox.org/ruby-core/5323FE11.3000908@atdot.net/
I may revisit non-atomic counters another day, but I think
this patch is good for now since I haven't found a downside
(running full benchmarks, will take a while)


Files

Updated by ko1 (Koichi Sasada) almost 6 years ago

Honestly speaking, I forget the discussions :p

The only one problem is, if we have a bug (and it can be. because nobody test malloc/free counts), the values grow into strange values.
Now we reset them at GC time.

What do you think about it?

Updated by normalperson (Eric Wong) almost 6 years ago

wrote:

The only one problem is, if we have a bug (and it can be. because nobody test malloc/free counts), the values grow into strange values.

I think we may get strange values with malloc_increase
currently, too. Not a big deal, GC may start later or earlier.

Now we reset them at GC time.

What do you think about it?

Not sure what you mean, we always ATOMIC_SIZE_EXCHANGE
malloc_increase to zero in gc_reset_malloc_info.
My patch also resets both counters (not 100% atomic because it's
two atomic operations, but I think it's OK as a hint).

Btw, I noticed I leftover a ssize_t usage in my patch
to gc_reset_malloc_info, will fixup if I commit this.

Updated by ko1 (Koichi Sasada) almost 6 years ago

sorry I missed the following line.

ssize_t inc = monoctr_xchg0(&objspace->malloc_params.m);

So no problem.

On last comment, I thought these counters are not reset forever.
(because I searched only = 0 lines)

Updated by normalperson (Eric Wong) almost 6 years ago

wrote:

On last comment, I thought these counters are not reset forever.
(because I searched only = 0 lines)

OK. Also, did we ever consider signed type (ssize_t) for malloc
counters instead? This may be more useful if we switch to
thread-local storage for counters to avoid atomics.

Probably won't be online much the next few days.

Updated by normalperson (Eric Wong) almost 6 years ago

Btw, I noticed I leftover a ssize_t usage in my patch
to gc_reset_malloc_info, will fixup if I commit this.

Updated patch with ssize_t fix, here:
https://80x24.org/spew/20180517024218.30730-1-e@80x24.org/raw

Also, I realized we don't need atomics or monotonic counter for
oldmalloc_increase. The only time we use oldmalloc_increase
is at gc_reset_malloc_info:

https://80x24.org/spew/20180517025137.GA1569@whir/raw

(GC.stat needs adjustment, though)

Not sure I notice a performance improvement, maybe 0.5%
Hard to measure...

Actions #6

Updated by normalperson (Eric Wong) almost 6 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r63463.


gc.c: use monotonic counters for objspace_malloc_increase

atomic_sub_nounderflow is expensive and objspace_malloc_increase
was showing up near the top of some perf profiles. The new
implementation allows the compiler to inline and eliminate
some branches from objspace_malloc_increase.

Furthermore, we do not need atomics for oldmalloc_increase

This consistently improves bm_so_count_words benchmark by
around 10% on my hardware.

name built
so_count_words 1.107

[ruby-core:87096] [Feature #14767]

Updated by normalperson (Eric Wong) almost 6 years ago

Eric Wong wrote:

OK. Also, did we ever consider signed type (ssize_t) for malloc
counters instead? This may be more useful if we switch to
thread-local storage for counters to avoid atomics.

Reverted for now because of regressions in:

bm_array_sample_100k__6k
bm_array_sample_100k___10k
bm_array_sample_100k___50k

I think the original unsigned + underflow avoidance code
prevented us from accounting free() properly, and we were
triggering GC more as a result.

With this patch, frees are accounted properly so malloc_increase
grew too slowly...

Also, I think it would be beneficial to check malloc_increase
and do a lazy sweep step BEFORE calling malloc, since that should
improve cache locality in malloc (because they're usually LIFO).

Updated by nobu (Nobuyoshi Nakada) almost 6 years ago

    size_t add = ATOMIC_SIZE_EXCHANGE(mc->add, 0);
    size_t sub = ATOMIC_SIZE_EXCHANGE(mc->sub, 0);

Is this combination of two atomic operations atomic?

Updated by normalperson (Eric Wong) almost 6 years ago

wrote:

    size_t add = ATOMIC_SIZE_EXCHANGE(mc->add, 0);
    size_t sub = ATOMIC_SIZE_EXCHANGE(mc->sub, 0);

Is this combination of two atomic operations atomic?

No, though I don't think it matters too much.
If we really care about atomicity, we can use pointers to
monoctr and swap those, instead.

struct monoctr *active;
struct monoctr a;
struct monoctr b;

ATOMIC_PTR_EXCHANGE will keep ->active pointing to &a or &b
But I think I'll redo the approach to this patch entirely next
week...

Updated by normalperson (Eric Wong) almost 6 years ago

Юрий Соколов wrote:

Eric Wong wrote:

Reverted for now because of regressions in:

        bm_array_sample_100k__6k
        bm_array_sample_100k___10k
        bm_array_sample_100k___50k

I think the original unsigned + underflow avoidance code
prevented us from accounting free() properly, and we were
triggering GC more as a result.

Why triggerring GC less frequently considered as disadvantage?

Memory usage got way high; I think those tests were around 8x
more memory AND 15% slower.

Of course too-frequent GC is bad, too; because it costs cycles,
and hurts locality; so we still need to find the right balance.

I think, real aplications will benefit from it.
Why this benchmarks suffers from it?

I think a big problem now is lazy sweeping isn't granular enough
and happens too late, increasing the chance of malloc
fragmentation.

We pay a huge cost for malloc accounting(*) but barely use that
data for making GC decisions.

(*) https://bugs.ruby-lang.org/issues/10238

Updated by normalperson (Eric Wong) almost 6 years ago

I wrote:

I think the original unsigned + underflow avoidance code
prevented us from accounting free() properly, and we were
triggering GC more as a result.

(Warning: thinking out loud while sleep-deprived)

What's dangerous about this patch correcting the
free()-before-malloc() accounting is that it not only breaks
expectations with our current malloc limits; everybody who sets
custom limits in env also gets things broken.

So, I'm not sure what to do about it...

Updated by naruse (Yui NARUSE) almost 6 years ago

I once get a thought, instead of counting memory Ruby itself bundling jemalloc and use jemalloc's memory profiler.
jemalloc counting allocated bytes inside it.
If Ruby accesses it, Ruby can avoid to execute extra atomic instructions.

Updated by normalperson (Eric Wong) almost 6 years ago

wrote:

I once get a thought, instead of counting memory Ruby itself bundling jemalloc and use jemalloc's memory profiler.
jemalloc counting allocated bytes inside it.
If Ruby accesses it, Ruby can avoid to execute extra atomic instructions.

Right. Ideally, we won't have to count malloc bytes at all.
What I would prefer is malloc could allow executing hooks
before it calls sbrk/mmap to get more memory from the kernel:
That would be the ideal time for us to GC.

Updated by normalperson (Eric Wong) almost 6 years ago

I wrote:

Also, I think it would be beneficial to check malloc_increase
and do a lazy sweep step BEFORE calling malloc, since that should
improve cache locality in malloc (because they're usually LIFO).

Unfortunately, adding lazy sweep steps (gc_sweep_continue) doesn't
seem to work well under malloc pressure. The problem seems to
be the order of heap->pages is fixed when pages are added to the
heap, and lazy sweep always walks them in newest-to-oldest order;
not MRU order, even.

This means calling gc_sweep_continue inside
objspace_malloc_increase when malloc_increase > 4096 (or any
number BEFORE malloc_limit to reduce pause under gc_rest)
doesn't always free the heaps with the biggest malloc-ed areas.

Now, I wonder; what if we start tracking malloc usage at a
per-heap_page level? It would require internal API changes so
we can quickly figure out which heap_page the malloc-ed pointer
would go to:

void *rb_malloc(VALUE obj, size_t size);
void *rb_realloc(VALUE obj, void *ptr, size_t size, size_t old_size);
void *rb_calloc(VALUE obj, size_t nmemb, size_t size);
void rb_free(VALUE obj, void *ptr, size_t n);

Updated by ko1 (Koichi Sasada) almost 6 years ago

Sorry for late response.

On 2018/06/13 18:59, Eric Wong wrote:

Unfortunately, adding lazy sweep steps (gc_sweep_continue) doesn't
seem to work well under malloc pressure.

One simple solution is disable lazy sweep if GC reason is malloc
(malloc_increase). Not so much cases, so that it can be acceptable. How
about it?

The problem seems to
be the order of heap->pages is fixed when pages are added to the
heap, and lazy sweep always walks them in newest-to-oldest order;
not MRU order, even.

Does MRU (or order) affect performance?

https://en.wikipedia.org/wiki/Cache_replacement_policies#Most_recently_used_(MRU)

introduced some examples, however I'm not sure it fits MRI.

This means calling gc_sweep_continue inside
objspace_malloc_increase when malloc_increase > 4096 (or any
number BEFORE malloc_limit to reduce pause under gc_rest)
doesn't always free the heaps with the biggest malloc-ed areas.

Now, I wonder; what if we start tracking malloc usage at a
per-heap_page level? It would require internal API changes so
we can quickly figure out which heap_page the malloc-ed pointer
would go to:

void *rb_malloc(VALUE obj, size_t size);
void *rb_realloc(VALUE obj, void *ptr, size_t size, size_t old_size);
void *rb_calloc(VALUE obj, size_t nmemb, size_t size);
void rb_free(VALUE obj, void *ptr, size_t n);

I doubt about this idea because we need additional computation and
storage size.

  • Keep per-heap_page malloc'ed size (overhead on each malloc/free)
  • Order by per-heap_page malloc'ed size

IMO these cost doesn't pay for overall performance.

API can be replaced. But not for this idea, IMO.

--
// SASADA Koichi at atdot dot net

Updated by normalperson (Eric Wong) almost 6 years ago

Koichi Sasada wrote:

Sorry for late response.

No problem.

On 2018/06/13 18:59, Eric Wong wrote:

Unfortunately, adding lazy sweep steps (gc_sweep_continue) doesn't
seem to work well under malloc pressure.

One simple solution is disable lazy sweep if GC reason is malloc
(malloc_increase). Not so much cases, so that it can be acceptable. How
about it?

So revert r48603? (commit 3a26241da3aec3d20dfc408a32de1c539455c89b)
It should reduce fragmentation, but maybe we won't need to
change things with transient heap.

The problem seems to
be the order of heap->pages is fixed when pages are added to the
heap, and lazy sweep always walks them in newest-to-oldest order;
not MRU order, even.

Does MRU (or order) affect performance?

https://en.wikipedia.org/wiki/Cache_replacement_policies#Most_recently_used_(MRU)

introduced some examples, however I'm not sure it fits MRI.

My theory was short-lived objects can be freed first, so we
try to optimize order of pages in sweeping phase to favor
pages with newest (and theoretically shortest-lived) objects.

I didn't find any improvement from gcbench-rdoc, though
(but maybe my patch is bogus):
https://80x24.org/spew/20180622081017.20225-1-e@80x24.org/raw

Your transient heap idea seems much better than anything I've
thought of, so lets focus on that :)

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0