Feature #9113

Ship Ruby for Linux with jemalloc out-of-the-box

Added by Sam Saffron 5 months ago. Updated about 1 month ago.

[ruby-core:58350]
Status:Feedback
Priority:Normal
Assignee:-
Category:build
Target version:-

Description

libc's malloc is a problem, it fragments badly meaning forks share less memory and is slow compared to tcmalloc or jemalloc.

both jemalloc and tcmalloc are heavily battle tested and stable.

2 years ago redis picked up the jemalloc dependency see: http://oldblog.antirez.com/post/everything-about-redis-24.html

To quote antirez:

But an allocator is a serious thing. Since we introduced the specially encoded data types Redis started suffering from fragmentation. We tried different things to fix the problem, but basically the Linux default allocator in glibc sucks really, really hard.


I recently benched Discourse with tcmalloc / jemalloc and default and noticed 2 very important thing:

median request time reduce by up to 10% (under both)
PSS (proportional share size) is reduced by 10% under jemalloc and 8% under tcmalloc.

We can always use LD_PRELOAD to yank these in, but my concern is that standard distributions are using a far from optimal memory allocator. It would be awesome if the build, out-of-the-box, just checked if it was on Linux (eg: https://github.com/antirez/redis/blob/unstable/src/Makefile#L30-L34 ) and then used jemalloc instead.

History

#1 Updated by Yui NARUSE 5 months ago

  • Status changed from Open to Assigned
  • Assignee set to Motohiro KOSAKI

Could you comment this?

#2 Updated by Nobuyoshi Nakada 5 months ago

  • Status changed from Assigned to Third Party's Issue

If system malloc is replaced with those newer libraries, ruby will use it.
Otherwise, configure with LIBS=-ljemalloc.

#3 Updated by Nobuyoshi Nakada 5 months ago

  • Category set to build
  • Assignee deleted (Motohiro KOSAKI)

#4 Updated by Martin Dürst 5 months ago

On one level, this feels like a non-brainer. But then the question is
why the standard memory allocator in libc hasn't been improved.

I can imagine all kinds of reasons, from "alternatives use too much
memory" to "not invented here". Any background info?

Regards, Martin.

On 2013/11/15 12:08, sam.saffron (Sam Saffron) wrote:

https://bugs.ruby-lang.org/issues/9113


I recently benched Discourse with tcmalloc / jemalloc and default and noticed 2 very important thing:

median request time reduce by up to 10% (under both)
PSS (proportional share size) is reduced by 10% under jemalloc and 8% under tcmalloc.

We can always use LD_PRELOAD to yank these in, but my concern is that standard distributions are using a far from optimal memory allocator. It would be awesome if the build, out-of-the-box, just checked if it was on Linux (eg: https://github.com/antirez/redis/blob/unstable/src/Makefile#L30-L34 ) and then used jemalloc instead.

#5 Updated by Sam Saffron 5 months ago

@martin this is a great "oldish" article by facebook about this http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919

@nobu I guess my suggestion here is to include jemalloc source in the the repo, and compile on demand for linux (by default with an option to opt-out) that way everyone will pick this change up and it becomes "officially blessed" allocator.

At Github @tmm1 has been using tcmalloc for years now, fragmentation is less good than jemalloc (tmm1 said perf is better, but I think its time to re-test cause I found jemalloc to perform better)

Regardless, libc allocator is a problem and default compiles should not use it.

Firefox has been using jemalloc for years and years, it is safe for production http://glandium.org/blog/?p=2581 ..

#6 Updated by Nobuyoshi Nakada 5 months ago

  • Status changed from Third Party's Issue to Rejected

Then it is a task of package maintainers.

#7 Updated by Motohiro KOSAKI 5 months ago

@duerst It is not correct. I and glibc folks are working on several improvement about malloc. Moreover, each allocator has different pros/cons. jemalloc can retrieve some workload better and glibc allocator can retrieve some other workload. There is no single perfect allocator. That's our difficulty.

@sam.saffron The Facebook's page you pointed out is out of date. It compare glibc 2.5 vs jemalloc 2.1.0. But latest are glibc 2.18 and jemalloc 3.4.1. And, glibc malloc and jemalloc shares a lot of basic design. So, this documentation is completely useless. If
you have several workload which glibc doesn't work well, please make and share benchmark instead of rumor. Then, we can improve several bottlenecks.

#8 Updated by Motohiro KOSAKI 5 months ago

It does NOT mean jemalloc has no chance. But we don't discuss performance issue if nobody has a number.

#9 Updated by Yui NARUSE 5 months ago

  • Status changed from Rejected to Feedback

#10 Updated by Eric Wong 3 months ago

Btw, jemalloc 3.5 includes an updated non-standard experimental API.

It looks like it has the ability to specify different arenas for
allocation (via MALLOCX_ARENA(a)). Perhaps could be used to
distinguish long/short-lived allocations.
Probably worth experimenting on some day...

#11 Updated by Eric Wong 3 months ago

I tried jemalloc 3.5.0 vs eglibc 2.13-38 (Debian x86_64)

http://80x24.org/bmlog-20140126-003136.7320.gz

Mostly close results, but I think our "make benchmark" suite is
incomplete and we need more fork/concurrency-intensive benchmarks of
large apps.

iofileread and vm2_bigarray seem to be big losses because jemalloc
tends to release large allocations back to the kernel more aggressively
(and the kernel must zero that memory).

[1] I have applied two patches for improved benchmark consistency:
https://bugs.ruby-lang.org/issues/5985#change-44442
https://bugs.ruby-lang.org/issues/9430
(Note: I still don't trust the vm_thread* benchmarks too much,
they seem very inconsistent even with no modifications)

#12 Updated by Sam Saffron 2 months ago

I can confirm 2 findings.

When heaps are small you barely notice a different.
When heaps grow and general memory fragmentation grows, jemalloc is far better.

I see a 6% reduction of RSS running discourse bench on 2.1.0 https://github.com/discourse/discourse/blob/master/script/bench.rb

An artificial test is:

@retained = []

MAX_STRING_SIZE = 100

def stress(allocate_count, retain_count, chunk_size)
  chunk = []
  while retain_count > 0 || allocate_count > 0
    if retain_count == 0 || (Random.rand < 0.5 && allocate_count > 0)
      chunk << " " * (Random.rand * MAX_STRING_SIZE).to_i
      allocate_count -= 1
      if chunk.length > chunk_size
        chunk = []
      end
    else
      @retained << " " * (Random.rand * MAX_STRING_SIZE).to_i
      retain_count -= 1
    end
  end
end

start = Time.now
stress(1_000_000, 600_000, 200_000)
puts "Duration: #{(Time.now - start).to_f}"

puts `ps aux | grep #{Process.pid} | grep -v grep`

For glibc

sam@ubuntu ~ % time ruby stress_mem.rb
Duration: 0.705922489
sam      17397 73.0  2.5 185888 156884 pts/10  Sl+  10:37   0:00 ruby stress_mem.rb
ruby stress_mem.rb  0.78s user 0.08s system 100% cpu 0.855 total

For jemalloc 3.5.0

Duration: 0.676871705
sam      17428 70.0  2.3 186248 144800 pts/10  Sl+  10:37   0:00 ruby stress_mem.rb
LD_PRELOAD=/home/sam/Source/jemalloc-3.5.0/lib/libjemalloc.so ruby   0.68s user 0.09s system 100% cpu 0.771 total

You can see the 8% or so better RSS with jemalloc

Note the more iterations you add the better jemalloc does. up allocations to 10 million

jemalloc 200mb rss vs glibc 230mb rss

glibc gets fragmented at a far faster rate than jemalloc

#13 Updated by Sam Saffron 2 months ago

Note, this pattern of

  1. Retaining large number of objects
  2. Allocating a big chunk of objects (and releasing)
  3. Repeating (2)

Is very representative of web apps / rails apps. For our application requests will range between 20k allocations and 200k allocations.

It is very much a scenario we want to optimise for.

on another note Rust lang just picked jemalloc, golang uses a fork of tcmalloc http://golang.org/src/pkg/runtime/malloc.h?h=tcmalloc

#14 Updated by Eric Wong 2 months ago

sam.saffron@gmail.com wrote:

An artificial test is:

@retained = []

MAXSTRINGSIZE = 100

def stress(allocatecount, retaincount, chunk_size)

Note: I think we should seed the RNG to a constant to have
consistent data between runs

 srand(123)

chunk = []
while retaincount > 0 || allocatecount > 0
if retaincount == 0 || (Random.rand < 0.5 && allocatecount > 0)
chunk << " " * (Random.rand * MAXSTRINGSIZE).toi
allocate
count -= 1
if chunk.length > chunksize
chunk = []
end
else
@retained << " " * (Random.rand * MAX
STRINGSIZE).toi
retain_count -= 1
end
end
end

Sam: Thank you!

I think we should integrate this test into the mainline benchmark suite.
Perhaps even provide an option to run with all the existing tests with
the big @retained array.

ko1: what do you think?

#15 Updated by Sam Saffron 2 months ago

@Eric

sure bench needs a bit more love to be totally representative of a rails request. Also this test will do ko1 lots of help improving the promotion to oldgen algorithm, we are talking about changing oldgen promotion to either use additional flags (as a counter) or only promote on major GC.

Either change will slash RSS in this test.

sam@ubuntu ~ % rbenv shell 2.1.0 
sam@ubuntu ~ % ruby stress_mem.rb 
Duration: 5.459891703
sam      17870  109  3.8 267076 238732 pts/10  Sl+  11:03   0:05 ruby stress_mem.rb
sam@ubuntu ~ % rbenv shell 2.0.0-p353       
sam@ubuntu ~ % ruby stress_mem.rb    
Duration: 7.616282557
sam      17986 95.6  2.0 151120 125684 pts/10  Sl+  11:04   0:07 ruby stress_mem.rb
sam@ubuntu ~ % 

This is basically a repro of the memory growth under 2.1.0 people are seeing.

238mb in Ruby 2.1 vs 125mb in 2.0

#16 Updated by Nobuyoshi Nakada 2 months ago

I'm absolutely against including external libraries into ruby repository itself, e.g., libyaml.
It may not be the worst idea to bundle them with the tarballs.

#17 Updated by Sam Saffron 2 months ago

@nobusan I think that would be a reasonable approach

@eric / @ko1 / everyone

here are the results of running that script across every 5 builds in the last year (with seeded rand)

https://gist.github.com/SamSaffron/9162366

I think big jumps should be investigated

#18 Updated by Koichi Sasada about 1 month ago

(2014/02/19 9:08), Eric Wong wrote:

Btw, I also hope to experiment with a slab allocator since many internal
objects are around the same size (like an OS kernel). This idea is
originally from the Solaris kernel, but also in Linux and FreeBSD. One
benefit with slab allocators over a general purpose malloc is malloc
has too little context/information make some decisions:

  • long-lived vs short-lived (good for CoW)
  • shared between threads or not
  • future allocations of the same class

Notes on slab: I don't think caching constructed objects like the
reference Solaris implementation does is necessary (or even good),
since it should be possible to transparently merge objects of different
classes (like SLUB in Linux, I think).

Anyways, I think jemalloc is a great general-purpose malloc for things
that don't fit well into slabs. And it should be easy to let a slab
implementation switch back to general-purpose malloc for
testing/benching.

Recently I'm working around this topic.

(1) Life-time oriented, similar to Copying GC
(2) CoW frindly (read only) memories

More detail about (2):
The following figure shows the stacked memory usage (snapshot) collected
by valgrind/massif, on discorse benchmark by @sam's help.
http://www.atdot.net/fp_store/f.69bk1n/file.copipa-temp-image.png

Interestingly, 50MB is consumed by iseq (iseq.c, compile.c). Most of
data are read only, so it can be more CoW frindly. Now, we mixes
read-only data and r/w data such as inline cahce.

There are several ideas. And I belive it is good topic to consider for
Ruby 2.2.

--
// SASADA Koichi at atdot dot net

#19 Updated by Eric Wong about 1 month ago

ko1@atdot.net wrote:

(2014/02/19 9:08), Eric Wong wrote:

Btw, I also hope to experiment with a slab allocator since many internal
objects are around the same size (like an OS kernel). This idea is
originally from the Solaris kernel, but also in Linux and FreeBSD. One
benefit with slab allocators over a general purpose malloc is malloc
has too little context/information make some decisions:

  • long-lived vs short-lived (good for CoW)
  • shared between threads or not
  • future allocations of the same class

Notes on slab: I don't think caching constructed objects like the
reference Solaris implementation does is necessary (or even good),
since it should be possible to transparently merge objects of different
classes (like SLUB in Linux, I think).

Anyways, I think jemalloc is a great general-purpose malloc for things
that don't fit well into slabs. And it should be easy to let a slab
implementation switch back to general-purpose malloc for
testing/benching.

Recently I'm working around this topic.

(1) Life-time oriented, similar to Copying GC
(2) CoW frindly (read only) memories

Yes. We should be able to do moving/defragmentation of long-lived
internal allocations, even.

Interestingly, 50MB is consumed by iseq (iseq.c, compile.c). Most of
data are read only, so it can be more CoW frindly. Now, we mixes
read-only data and r/w data such as inline cahce.

Yes, also the iseq struct is huge (300+ bytes on 64-bit). I think we
can shrink it (like I did with struct vtm/time_object) and move r/w data
off to a different area.

There are several ideas. And I belive it is good topic to consider for
Ruby 2.2.

OK; especially since this should have no public API breakage.

#20 Updated by Eric Wong about 1 month ago

Sam: btw, if you have time, can you prepare a patch which integrates
jemalloc with the build/tarball dist?

We should also see if using the non-standard jemalloc API is worth it.
(with fallbacks to standard APIs on systems where jemalloc is unsupported).

Also available in: Atom PDF