Bug #7135

GC bug in Ruby 1.9.3-p194?

Added by Alex Dowad over 1 year ago. Updated over 1 year ago.

[ruby-core:47881]
Status:Closed
Priority:Normal
Assignee:-
Category:-
Target version:-
ruby -v:ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] Backport:

Description

I'm just doing some refactoring/performance work on a popular Ruby gem called Prawn (it's used for PDF generation). I'm fighting with a strange, intermittent failure on the spec tests, and from my experimentation so far, it seems very, very likely to be a bug in Ruby's garbage collector.

I'll try to keep this as brief as possible, but please be patient...

The code where the intermittent failure comes from measures the width of a string when rendered using a TTF font. This is part of the method body (including my debug print statements):

      # GC.disable
      # string.freeze
      p string.bytes.to_a if $my_debug
      p string.codepoints.to_a if $my_debug
      p scale if $my_debug
      result = string.codepoints.inject(0) do |s,r|
        print r if $my_debug
        print "," if $my_debug
        s + character_width_by_code(r)
      end * scale
      puts if $my_debug
      result

When the tests pass normally (which is about 7/8 of the time), the debug print statements show:

[104, 101, 108, 108, 111, 194, 173]
[104, 101, 108, 108, 111, 173]
0.012
104,101,108,108,111,173,

...You can see that the "print" calls in the "string.codepoints.inject" loop print the same series of codepoints as "p string.codepoints.to_a". This is what you would expect, because nothing is modifying the string. But about 1/8 of the time I get:

[104, 101, 108, 108, 111, 194, 173]
[104, 101, 108, 108, 111, 173]
0.012
104,42,0,0,0,0,0,

I have also seen "104,0,0,0,0,0" on occasion. In all cases, "p string.codepoints.to_a" prints the correct sequence of codepoints for the string.

You might think that something in the "string.codepoints.inject" loop is modifying the string, but it's not. I could show the contents of "characterwidthby_code", but it would just be wasting your time, because it basically contains nothing but a couple of hash lookups.

If I uncomment "string.freeze", I can run the tests 100 times or more with no failure. (This proves that the string is not being modified by my code, because it would throw an exception otherwise!) Or, if I change the code to "string.codepoints.to_a.inject", again, the failure never happens. Most revealingly, if I uncomment "GC.disable", I can run the test 100 or more times with no failure. As soon as I comment out "GC.disable", the random failure comes back, for about 1/7 - 1/10 of runs.

Sometimes I also get another random failure from the same place: an "invalid codepoint in UTF-8" exception from "string.codepoints.inject". Again, this proves something inside Ruby is corrupting the string, because the call to "string.codepoints" just 2 lines before prints the correct sequence, with no exception raised.

I'd like to boil this down to a smaller example which demonstrates the failure, but it's a hopeless task. When I take pieces of the code and run them in irb, the failure never happens. Even rebooting the computer may make it go away... but then, when I am working on Prawn again, sooner or later it happens again. (I know because similar intermittent failures have happened before in the past.) Once it starts happening, though, it's pretty consistent at about 1/7 - 1/10 of test runs. Another clue is that the corrupted codepoints are always zero.

I can try to track down the problem, perhaps by adding some logging code to the Ruby interpreter source and recompiling, but I need some guidance on where to look. Can anyone who is familiar with Ruby internals (especially Strings and the GC) give me some ideas how to start?

Associated revisions

Revision 37143
Added by Nobuyoshi Nakada over 1 year ago

string.c: GC guard

  • string.c (rbstrsub{seq,pos,str}, rbstreach_{line,codepoint}): prevent String copies from GC. [Bug #7135]

History

#1 Updated by Nobuyoshi Nakada over 1 year ago

  • Status changed from Open to Feedback

I can't reproduce it.

#2 Updated by Alex Dowad over 1 year ago

Nobuさん, I don't expect that you (or anyone else) would be able to reproduce this bug. As I said, it doesn't happen when I extract the part which is failing from Prawn, only when I run the tests against the whole thing (which I have modified -- I'm working on performance). This is not strange -- in general, memory corruption/pointer bugs are sensitive to the exact layout of data in memory, and changing small things in a program may randomly turn the bug on or off.

As I said, I can try to dig deeper and diagnose the bug myself, but I need some advice on where to add "debug" code to the Ruby source (so I can recompile, run the code which is failing, and try to get more information on what is actually happening).

To sum up the problem again, I have Ruby Strings which are randomly being overwritten (although nothing at the Ruby level is modifying them), and it only happens when the GC runs. Actually, I just discovered that if I put a call to "GC.start" in the "string.codepoints.inject" loop, the error happens every time. UNLESS I freeze the string -- then the error never happens, even with "GC.start" in the loop.

A few questions for someone who knows Ruby internals well:
- When Ruby GCs an unused object, does it zero out the memory used?
- How about when a new object is allocated?
- I've heard that Ruby stores the contents of small strings directly in an RObject (or RValue or whatever it is...) union. The String which is being corrupted has 7 bytes. Will a String like that always be embedded, or is it possible that it could still use malloc'd memory for the contents?
- In the tests which I am doing right now, it always seems that byte 0 is untouched, byte 1 is changed to 1, and bytes 2-6 are changed to 0. Do those values seem familiar? Is there a different type of object which can go in the same union, which would set those particular bytes?

#3 Updated by Eric Wong over 1 year ago

"alexdowad (Alex Dowad)" alexinbeijing@gmail.com wrote:

Nobuさん, I don't expect that you (or anyone else) would be able to reproduce this bug. As I said, it doesn't happen when I extract the part which is failing from Prawn, only when I run the tests against the whole thing (which I have modified -- I'm working on performance). This is not strange -- in general, memory corruption/pointer bugs are sensitive to the exact layout of data in memory, and changing small things in a program may randomly turn the bug on or off.

Does this happen with unmodified Prawn at all?

I'm not familiar with Prawn, but does any of its dependencies pull in
extra C extension which may have memory corruption bugs?

Can you share your work-in-progress changes to Prawn?

  • When Ruby GCs an unused object, does it zero out the memory used?

No

  • How about when a new object is allocated?

Yes.

  • I've heard that Ruby stores the contents of small strings directly in an RObject (or RValue or whatever it is...) union. The String which is being corrupted has 7 bytes. Will a String like that always be embedded, or is it possible that it could still use malloc'd memory for the contents?

It's possible to use malloc'ed memory for short string contents
(string capacity can be larger than length)

#4 Updated by Alex Dowad over 1 year ago

Does this happen with unmodified Prawn at all?

Good question. I haven't spent a lot of time repeatedly running the spec tests for "unmodified" Prawn. Generally when I run some tests, it's because I'm contributing a patch to the gem, and I want to make sure I haven't broken anything.

I can tell you, though, that a few weeks ago, when I was working on a completely unrelated patch to Prawn, I also started getting an intermittent "invalid byte sequence in UTF-8" error when I was testing against Ruby 1.9.2. When I was tracing the error, I found the same thing with String#codepoints returning inconsistent results. I discovered that freezing the String made the problem "go away", and did a little reading of the Ruby source (which revealed that String#codepoints seems to treat frozen strings specially). It never occurred to me at the time that the problem might have anything to do with the GC, and I didn't pursue it further until now.

I'm not familiar with Prawn, but does any of its dependencies pull in
extra C extension which may have memory corruption bugs?

No. The core team has committed to never using binary gems, only pure Ruby.

Can you share your work-in-progress changes to Prawn?

Do you really want to spend a few hours or days of your life helping to track down an obscure memory bug? If so, I'll push the code I am working on to GitHub.

At this point I think I have already got as much information as I can from Ruby-land -- I need to crack Ruby open and drop down into C-land. Right now I'm mainly fishing for information and ideas which will help when I do that...

#5 Updated by Alex Dowad over 1 year ago

OK, I found a couple more significant things:

  1. I can reproduce the problem on Ruby 1.9.2 and 1.9.3, but never 1.8.7.
  2. I tried putting calls to "print string.bytes.toa" and "print string.codepoints.toa" inside the "string.codepoints.inject" loop. They always print the correct sequence of values, even when the codepoints being passed in to the "inject" block are incorrect!

Could the GC be moving the String, leaving the Enumerator returned by #codepoints pointing to an unused block of memory?

#6 Updated by Alex Dowad over 1 year ago

More information:

When I restructure the code to avoid using an Enumerator, like this:

      s = 0
      string.codepoints do |r|
        GC.start if $my_debug
        if $my_debug
          print r, "(", string.codepoints.to_a.inspect, "),"
        end
        s += character_width_by_code(r)
      end
      result = s * scale

...the problem still occurs under Ruby 1.9.2 and 1.9.3.

#7 Updated by Eric Wong over 1 year ago

"alexdowad (Alex Dowad)" alexinbeijing@gmail.com wrote:

Eric Wong normalperson@yhbt.net wrote:

I'm not familiar with Prawn, but does any of its dependencies pull in
extra C extension which may have memory corruption bugs?

No. The core team has committed to never using binary gems, only
pure Ruby.

That's good to hear.

Can you share your work-in-progress changes to Prawn?

Do you really want to spend a few hours or days of your life helping
to track down an obscure memory bug? If so, I'll push the code I am
working on to GitHub.

Me specifically? No, at least not yet :)

However, there's always a chance somebody can help you if given the
right information.

Since this doesn't seem to be caused by a buggy C extension, perhaps it
is a bug which can affect non-Prawn users, too. In that case, more
people will be willing to help you track this problem down.

#8 Updated by Alex Dowad over 1 year ago

I patched the Ruby interpreter source and recompiled, but I'm having trouble using the resulting binary. The problem is that I don't want to run "make install", because I don't want to mess up my system's setup. But without "make install" a lot of stuff doesn't seem to work. Any suggestions on how to compile and use a patched version of Ruby without installing it as the system-wide default Ruby?

#9 Updated by Alex Dowad over 1 year ago

OK, I overcame the problem with compiling and testing a patched Ruby binary. When I comment out line 6229 of string.c, the problem goes away. Then when I uncomment the line and recompile Ruby, the problem comes back again. This is the code (it's for String#codepoints):

static VALUE
rbstreachcodepoint(VALUE str)
{
VALUE orig = str;
int n;
unsigned int c;
const char *ptr, *end;
rb
encoding *enc;

if (single_byte_optimizable(str)) return rb_str_each_byte(str);
RETURN_ENUMERATOR(str, 0, 0);
str = rb_str_new4(str); /* I think problem is here */
ptr = RSTRING_PTR(str);
end = RSTRING_END(str);
enc = STR_ENC_GET(str);
while (ptr < end) {
c = rb_enc_codepoint_len(ptr, end, &n, enc);
rb_yield(UINT2NUM(c));
ptr += n;
}
return orig;

}

Line 6229 copies the String, so that #codepoints won't get messed up if someone modifies it while iterating.

#10 Updated by Alex Dowad over 1 year ago

OK, I have established beyond all doubt that the contents of the String are being overwritten -- not the original String, but the frozen copy which #codepoints makes internally. Additionally, the overwriting definitely happens when the GC runs.

Question: does the Ruby GC look for Object references by scanning the C-level stack?

#11 Updated by Nobuyoshi Nakada over 1 year ago

Hi,

At Thu, 11 Oct 2012 08:32:23 +0900,
alexdowad (Alex Dowad) wrote in :

OK, I overcame the problem with compiling and testing a
patched Ruby binary. When I comment out line 6229 of
string.c, the problem goes away. Then when I uncomment the
line and recompile Ruby, the problem comes back again. This
is the code (it's for String#codepoints):

Thank you. Could you try this patch?

diff --git i/string.c w/string.c
index 9281e4c..6707c4b 100644
--- i/string.c
+++ w/string.c
@@ -6332,6 +6332,7 @@ rbstreachcodepoint(VALUE str)
rb
yield(UINT2NUM(c));
ptr += n;
}
+ RBGCGUARD(str);
return orig;
}

--
Nobu Nakada

#12 Updated by Nobuyoshi Nakada over 1 year ago

Hi,

At Thu, 11 Oct 2012 10:08:51 +0900,
alexdowad (Alex Dowad) wrote in :

Question: does the Ruby GC look for Object references by
scanning the stack?

Of course yes, but recent compilers often optimize out those
variables.

--
Nobu Nakada

#13 Updated by Alex Dowad over 1 year ago

YEE-HA!!! I think I may have nailed it!!!

I believe that my compiler was storing the pointer to the frozen string copy in a register, rather than on the stack, so the garbage collector couldn't find any references to it. But even after the frozen copy was GC'd, #codepoints still had C pointers into the middle of its data. After creating a new "volatile" local, and storing the pointer in there, the problem hasn't occurred again.

I'm going to do more testing to try to confirm that this is true. If it is, I'll submit a patch for the interpreter. Question: can I submit patches through GitHub pull requests? Or is it necessary to use svn?

#14 Updated by Alex Dowad over 1 year ago

Hi Nobuさん,

I just saw your messages after posting. Has the patch you showed already been applied to edge Ruby?

#15 Updated by Nobuyoshi Nakada over 1 year ago

Hi,

At Thu, 11 Oct 2012 10:33:16 +0900,
alexdowad (Alex Dowad) wrote in :

I just saw your messages after posting. Has the patch you
showed already been applied to edge Ruby?

Not yet. I'll apply it if it fixes the bug.

--
Nobu Nakada

#16 Updated by Alex Dowad over 1 year ago

  • RBGCGUARD(str);

This also fixes the problem. I looked on GitHub, and it looks like this patch hasn't been applied to the newest version of the Ruby source... I'll submit a pull request.

Thanks to @nobu and @normalperson for your help!!!

#17 Updated by Alex Dowad over 1 year ago

Not yet. I'll apply it if it fixes the bug.

I'd prefer to submit my own PR, if it's OK with you. It would somehow make staying up until 4AM to debug this problem seem worthwhile...

#18 Updated by Nobuyoshi Nakada over 1 year ago

Hi,

At Thu, 11 Oct 2012 11:03:31 +0900,
alexdowad (Alex Dowad) wrote in :

I'd prefer to submit my own PR, if it's OK with you. It would
somehow make staying up until 4AM to debug this problem seem
worthwhile...

I see. In which timezone 4AM?

--
Nobu Nakada

#19 Updated by Alex Dowad over 1 year ago

I see. In which timezone 4AM?

That's 4AM Zambian time... right now I'm serving as a volunteer in Zambia (and doing Rails-related consulting on the side to support myself and my wife).

#21 Updated by Nobuyoshi Nakada over 1 year ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r37143.
Alex, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


string.c: GC guard

  • string.c (rbstrsub{seq,pos,str}, rbstreach_{line,codepoint}): prevent String copies from GC. [Bug #7135]

Also available in: Atom PDF