Bug #1682
closedLow probability error in gc
Description
=begin
I am trying to track a very hard to reproduce one in a million
segfault in ruby that seems to be a corrupted heap.
So I'm am using valgrind on the latest (24th Jun 2009) snapshot from
ftp://ftp.ruby-lang.org/pub/ruby/stable-snapshot.tar.gz
Yup, you've seen this one before and declared it innocuous, but
patience, follow my chain of logic and you will realise in certain
corner cases it may be fatal.
Valgrind reports the... and the standard answer is yes, yes, we know
it's Rubies conservative garbage collection it doesn't matter, but
keep reading and I will show you how it may on occasions matter.
==5709== Conditional jump or move depends on uninitialised value(s)
==5709== at 0x8075383: gc_mark (ruby.h:731)
==5709== by 0x8074F98: mark_locations_array (gc.c:708)
==5709== by 0x8075701: garbage_collect (gc.c:1497)
==5709== by 0x8075FFC: rb_newobj (gc.c:453)
==5709== Uninitialised value was created by a stack allocation
==5709== at 0x8095D84: rb_node_newnode (parse.y:4591)
If we have a look at that location in parse.y we have this code.¶
NODE*
rb_node_newnode(type, a0, a1, a2)
enum node_type type;
VALUE a0, a1, a2;
{
NODE n = (NODE)rb_newobj();
ie. The uninitialized value is the "n". Nothing special about this
line, we get to this state from many other locations.
So what is happening here...
..when a new object is allocated an nothing is available in on the free list,
The gc runs around marking everything that is still "live" and then
sweeps up the leavings.
But what is still "live"? Anything currently on the stack for example.
But what about things like "n" which are on the stack, but not initialized yet?
No problem, if it is a number that, if viewed as a pointer, points
into a chunk of heap. We regard it as "live" anyway. If we're wrong
and it's just a number... or even a random string of uninitialized
bits left on the stack (like "n" above), no problem.. we just don't
garbage collect it and waste a bit of ram.
FLUSH_REGISTER_WINDOWS;
/* This assumes that all registers are saved into the jmp_buf (and stack) */
rb_setjmp(save_regs_gc_mark);
mark_locations_array((VALUE*)save_regs_gc_mark, sizeof(save_regs_gc_mark) / sizeof(VALUE *));
#if STACK_GROW_DIRECTION < 0
rb_gc_mark_locations((VALUE*)STACK_END, rb_gc_stack_start);
Inicidentally rb_gc_mark_locations is implemented in terms of
mark_locations_array so lets have a closer look at
mark_locations_array
static inline int
is_pointer_to_heap(ptr)
void *ptr;
{
register RVALUE *p = RANY(ptr);
register RVALUE *heap_org;
register long i;
if (p < lomem || p > himem) return Qfalse;
if ((VALUE)p % sizeof(RVALUE) != 0) return Qfalse;
/* check if p looks like a pointer */
for (i=0; i < heaps_used; i++) {
heap_org = heaps[i].slot;
if (heap_org <= p && p < heap_org + heaps[i].limit)
return Qtrue;
}
return Qfalse;
}
static void
mark_locations_array(x, n)
register VALUE *x;
register long n;
{
VALUE v;
while (n--) {
v = *x;
if (is_pointer_to_heap((void *)v)) {
gc_mark(v, 0);
}
x++;
}
}
So mark_locations_array just scans along an arbitary chunk of
ram.. eg. the stack. Treating every four bytes as a pointer (whether
it is or not).
The is_pointer_to_heap predicate checks to see if that number points
to a location in the heap...
if (p < lomem || p > himem) return Qfalse;
and that is typically the line that valgrind whinges about.
Occasionally an uninitialized value just happens to pass that first
crude check and then valgrind whinges on this line.
if (heap_org <= p && p < heap_org + heaps[i].limit)
return Qtrue;
But then we can be fairly confident that p is pointing somewhere
within a valid ruby object that we should mark as live. Perhaps it
shouldn't be, perhaps it was just an accident that some number just
happen to be in that range, so we on rare occassions waste some ram,
so what.
So far, perfect.
Alas, instead of returning the start of that ruby object, we just
return true.
Actually the thing living between heaps[i].slot and
heaps[i].slot+heaps[i].limit is typically an array of roughly
10000 RValue's.
And then instead of marking the start of that heap block as live, or
the start of the RValue within which v is pointing to , we mark what
thing v is pointing to as live.
v = *x;
if (is_pointer_to_heap((void *)v)) {
gc_mark(v, 0);
That would be OK... except gc_mark (sometimes) modifies what v is
pointing to by flipping the FL_MARK bit. And then hunting down the
children of that object and marking them.
static void
gc_mark(ptr, lev)
VALUE ptr;
int lev;
{
register RVALUE *obj;
obj = RANY(ptr);
if (rb_special_const_p(ptr)) return; /* special const not marked */
if (obj->as.basic.flags == 0) return; /* free cell */
if (obj->as.basic.flags & FL_MARK) return; /* already marked */
obj->as.basic.flags |= FL_MARK;
All Good. EXCEPT if v is pointing into the middle of a 20 byte RValue
union instead of the start!
Ok, so 99.999% of random bitstrings like "n" picked up this way got
there because they were at some stage valid pointers to RValues BUT it
is NOT gauranteed.
ie. We have a very low probability bug that gc will treat the inner
part of a RValue as an RValue with possible bad results. (segfault,
corruption...)
Solution? Change the line...
gc_mark(v, 0);
to convert v to the nearest sizeof(RValue) aligned address at or below.
gc_mark(v - (v % sizeof(RVALUE)), 0);
Well, at least in this case of invoking gc_mark. I haven't reviewed
all the other invocations.
=end