We got ruby application running on our production server and noticed that it regularly crashes with out of memory errors.
After months of investigation, I narrowed the case to the examples (1/2).
After digging ruby sources and running test code, I found out that GC stopped working after recovering from native stack overflow error.
Probably the relevant code appeared in 2.2 https://github.com/ruby/ruby/commit/0c391a55d3ed4637e17462d9b9b8aa21e64e2340
where ruby_disable_gc_stress became ruby_disable_gc.
I'm having similar issue when running tests with capybara that starts additional server in a new thread.
If I have some problems in my rails app that raises SystemStackError in the server thread then I am left without a GC and memory just keeps growing and growing, I have tried manually calling GC.start after that, but it doesn't help, GC.stat dislays the same number for major/minor.
In ruby 2.2 it looks like if a stack overflow is raised in a thread, the thread just dies. I was running with 2.2.0 and not 2.2.3.
In ruby trunk (and also 2.2.3) it looks like if a stack overflow is raised in a thread 2 times then on the 2nd time the whole process just hangs and the only way to stop is kill -9.
Unfortunately, it is known problem (2nd time machine stack overflow we can not capture correctly).
1st machine stack overflow
SEGV
check machine stack overflow
raise an error from signal handler (*1) by longjmp.
2nd machine stack overflow
SEGV
signal status is signaling. So OS can not deliver signal correctly...
The correct way is restoring signal status using sigsetjmp/siglongjmp at (*1).
However, on Linux 2.x, siglongjmp is too slow than longjmp, so that we continue to use longjmp, at least the last time we had discussed this issue. We can't slow down Ruby interpreter for such a corner case.
However, Linux 2.x is older OS. So that we can change.
(BTW, I'm using Linux 2.6 on several machines)