Bug #11101
closedForking is killing my memory when running GC
Added by tkalmus (Thomas Kalmus) over 9 years ago. Updated over 5 years ago.
Description
Memory is duplicated when forking and calling GC in 2.2.1. The main issue here is that we are operating with huge data, going up to 3GB, and one fork is enough to kill our machine.
We have written a small program that reproduces the issue (see attached file).
The program instantiates an object and then forks into two processes.
The GC is called in the child process.
The memory allocation (as given by /proc/pid/smaps) changes from shared to private, thereby indicating a doubling of memory consumption.
Here is the output of the program (size is in mb):
ruby version 2.1.3
time pid message shared private
4.011s 4723 Parent pre GC 68 0
4.013s 4737 Child pre GC 68 0
8.019s 4723 Parent post GC 5 62
8.093s 4737 Child post GC 5 66
We have tested the program on Ubuntu 14.04 with ruby 1.9.3 and 2.2.1. The tests have been performed on a freshly installed Ubuntu machine.
We have also tried to fork 10 children and see a 10 doubling of the memory consumption, the issue only occurs after running the GC.
Files
Updated by normalperson (Eric Wong) over 9 years ago
Is this with or without jemalloc?
In either case, try disabling transparent hugepages in Linux, THP causes
some problems with Postgres and Redis forking, too (it is far easier to
dirty and trigger CoW on 2M hugepages than regular 4K pages).
Updated by tkalmus (Thomas Kalmus) over 9 years ago
This is without jemalloc. I made the changes just like requested, but the issue still persists. I get what I want, when I turn the GC off (GC.disabled), but this isn't an option since, anyway after a time my memory is filled with garbage.
Updated by tkalmus (Thomas Kalmus) over 9 years ago
I have looked at the thing closely and it seems that the GC is not CoW friendly for sparse arrays. If I create an array of undefined size and populate it with random numbers in string format, the GC will launch the CoW in the fork. However if the array has fixed size, or if I make a huge string of, for example, 30MB instead then I don't see the issue. Unfortunately, I am using a lot of dynamic arrays. Is there a way to bypass this ?
Updated by hsbt (Hiroshi SHIBATA) over 9 years ago
- Assignee changed from tkalmus (Thomas Kalmus) to ko1 (Koichi Sasada)
- Priority changed from 7 to Normal
Updated by ko1 (Koichi Sasada) over 9 years ago
Sorry for long absent.
Could you try https://rubygems.org/gems/nakayoshi_fork ?
With this gem, your script shows on my environment:
ruby version 2.3.0
time pid message shared private
4.021s 25883 Parent pre GC 69 0
4.021s 25885 Child pre GC 69 0
8.015s 25885 Child post GC 63 6
8.052s 25883 Parent post GC 63 6
Updated by ko1 (Koichi Sasada) over 9 years ago
Results on versions:
Without nakayoshi-fork:
ruby version 1.9.3
time pid message shared private
4.034s 25939 Parent pre GC 85 0
4.034s 25941 Child pre GC 85 0
8.052s 25941 Child post GC 5 84
8.068s 25939 Parent post GC 5 80
terminate child
ruby version 2.0.0
time pid message shared private
4.030s 25944 Parent pre GC 67 0
4.030s 25947 Child pre GC 67 0
8.045s 25947 Child post GC 5 65
8.066s 25944 Parent post GC 5 62
terminate child
ruby version 2.1.5
time pid message shared private
4.032s 25950 Parent pre GC 68 0
4.032s 25952 Child pre GC 68 0
8.048s 25952 Child post GC 5 66
8.064s 25950 Parent post GC 5 63
terminate child
ruby version 2.2.3
time pid message shared private
4.045s 25957 Child pre GC 71 0
4.045s 25955 Parent pre GC 71 0
8.072s 25957 Child post GC 5 67
8.081s 25955 Parent post GC 4 66
terminate child
ruby version 2.3.0
time pid message shared private
4.032s 25964 Child pre GC 70 0
4.032s 25962 Parent pre GC 70 0
8.053s 25964 Child post GC 5 66
8.064s 25962 Parent post GC 5 65
terminate child
With nakayoshi-fork:
ruby version 2.0.0
time pid message shared private
4.032s 25974 Parent pre GC 67 0
4.032s 25976 Child pre GC 67 0
8.065s 25974 Parent post GC 5 62
8.065s 25976 Child post GC 5 66
terminate child
ruby version 2.1.5
time pid message shared private
4.071s 25982 Child pre GC 70 1
4.071s 25980 Parent pre GC 70 1
8.062s 25982 Child post GC 64 7
8.101s 25980 Parent post GC 64 7
terminate child
ruby version 2.2.3
time pid message shared private
4.034s 25985 Parent pre GC 70 2
4.036s 25987 Child pre GC 70 2
8.034s 25987 Child post GC 63 8
8.073s 25985 Parent post GC 63 8
terminate child
ruby version 2.3.0
time pid message shared private
4.019s 26035 Child pre GC 70 1
4.020s 26032 Parent pre GC 70 1
8.019s 26035 Child post GC 63 8
8.053s 26032 Parent post GC 63 8
terminate child
I'm not sure why Ruby 2.0.0 is not CoW frinedly.
Updated by ko1 (Koichi Sasada) about 9 years ago
- Status changed from Open to Feedback
Updated by olivierlacan (Olivier Lacan) over 8 years ago
Has anyone followed-up on this issue ever? I ran into this yesterday after deploying 2.2 to an app server and noticing twice slower response times than on 2.1.
Updated by normalperson (Eric Wong) over 8 years ago
hi@olivierlacan.com wrote:
Has anyone followed-up on this issue ever? I ran into this yesterday
after deploying 2.2 to an app server and noticing twice slower
response times than on 2.1.
Did you try the nakayoshi_fork gem as ko1 suggested?
Performance is unlikely to be twice as slow unless you're using enough
memory to hit swap... How big are your processes on 2.1 and 2.2?
Is this on 64-bit?
What else can you tell us about your setup?
(malloc used, GC tunings, etc).
Thanks.
Updated by olivierlacan (Olivier Lacan) over 7 years ago
normalperson (Eric Wong) wrote:
Did you try the nakayoshi_fork gem as ko1 suggested?
Sorry for replying so late. I think I missed the notification when you originally replied.
I believe nakayoshi_fork did help a bit to bring response times near to their pre 2.2
averages.
Performance is unlikely to be twice as slow unless you're using enough
memory to hit swap...
To be fair, it wasn't. I think we had a 30 ms average that jumped to 60 at first and
leveled of at 50 ms with nakayoshi_fork. I don't believe we were close to hitting swap, and
I think I was seeing this on multiple different servers at the time.
How big are your processes on 2.1 and 2.2?
I'm not sure what you're asking there.
Is this on 64-bit?
Yes.
What else can you tell us about your setup?
(malloc used, GC tunings, etc).
We're not using any GC tunings and running with Passenger Enterprise on Debian 8. No jemalloc if that's what you're wondering.
passenger_max_pool_size: 16
max_instances: 12
Interestingly, upgrading to 2.3.4 (still with nakayoshi_fork) brought us back to pre-2.2 response time averages of about 35 ms, and the averages are much more stable as well.
Updated by normalperson (Eric Wong) over 7 years ago
hi@olivierlacan.com wrote:
normalperson (Eric Wong) wrote:
Did you try the nakayoshi_fork gem as ko1 suggested?
Sorry for replying so late. I think I missed the notification when you originally replied.
I believe nakayoshi_fork did help a bit to bring response times near to their pre 2.2
averages.
No worries about being late, good to know nakayoshi_fork helped.
Performance is unlikely to be twice as slow unless you're using enough
memory to hit swap...To be fair, it wasn't. I think we had a 30 ms average that
jumped to 60 at first and leveled of at 50 ms with
nakayoshi_fork. I don't believe we were close to hitting swap,
and I think I was seeing this on multiple different servers at
the time.
You don't sound too confident about whether or not you're
swapping :) Actual memory in swap is harmless if it's static,
what hurts is swapping in and out as that requires I/O.
If you're on Linux, try the "vmstat 1" command and watch
"si" and "so" columns for swap-in/swap-out.
vmstat should be commonly installed via "procps" on Debian-based
systems, at least.
How big are your processes on 2.1 and 2.2?
I'm not sure what you're asking there.
I mean how much memory are your Ruby processes using? (Again,
Linux-specific), that would be the RES column of "top", and you
can get more-fine grained output via "pmap -x #{pid}" for any
particular process. The "[anon]" memory is what's actually
allocated by Ruby.
Interestingly, upgrading to 2.3.4 brought us back to pre-2.2
response time averages of about 35 ms, and the averages are
much more stable as well.
Good to know...
However, I'm wondering if that was the result of general memory
reductions in 2.3.x and if your results are consistent to Ruby
2.4.x or to other applications.
I say that because (AFAIK) there were no significant changes to
the GC for 2.3 (but some bugfixes and minor improvements)
Updated by jeremyevans0 (Jeremy Evans) over 5 years ago
- Status changed from Feedback to Closed