Project

General

Profile

Actions

Bug #20021

closed

TestGCCompact#test_moving_hashes_down_size_pools is flaky

Added by kjtsanaktsidis (KJ Tsanaktsidis) 6 months ago. Updated 5 months ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:115483]

Description

The test TestGCCompact#test_moving_hashes_down_size_pools is flaky is, I believe, flaky, and sometimes fails with this message:

  1) Failure:
TestGCCompact#test_moving_hashes_down_size_pools [/home/chkbuild/chkbuild/tmp/build/20231112T183003Z/ruby/test/ruby/test_gc_compact.rb:442]:
Expected 499 to be >= 500.

I started looking at this because when https://github.com/ruby/ruby/pull/8858 was merged, rubyci began failing on Ubuntu 22.04 with this message (https://rubyci.s3.amazonaws.com/ubuntu2204/ruby-master/log/20231112T183003Z.fail.html.gz). I was able to reproduce this failure after very carefully reproducing the ruibyci testing environment (same EC2 instance type, and I actually had to install exactly the same kernel version as well!).

After debugging the test, what I discovered is that the last hash that was modified in the test is actually on the machine stack when GC.verify_compaction_references is called, which means it gets pinned and cannot be moved; thus, the test fails because only 499 out of 500 objects got moved! The hash's VALUE is "on the stack" in the sense that it's in a memory location below %rsp, but it is NOT actually live in any C local variable. The stack address I found the VALUE in is part of the frame of gc_start, but gc_start obviously doesn't hold a reference to the hash from the test. Rather, it's left-behind in uninitialized memory.

I think this is made more likely because of how GCC wound up compiling the gc_start function. It seems to have allocated quite a lot of stack space - 440 bytes of it - and very little of it seems to be used for local variables. Thus, a lot of this memory just contains assorted tidbits from previous VM execution state.

This patch works around the issue, at least on my testbench, by running the code which touches the test hashes in a separate fiber. Each fiber gets its own machine stack, so references to the to-be-moved hashes shouldn't leak into the main stack, and this fiber's stack should be out of scope when the compaction is run, so it should get freed. https://github.com/ruby/ruby/pull/9040

The other approach I considered is counting the number of pinned objects, and asserting that (moved + pinned) >= 500 instead of just moved >= 500. However, it's very likely that there will be a pinned hash (or string, etc, for the other, similar test cases), so this would make the test pass even when it should not, I think.

Also, since this these tests were skipped on Solaris because of this bug, I have unskipped them (although it seems we no longer have solaris on ruby ci, so I guess it doesn't matter).

Updated by mame (Yusuke Endoh) 5 months ago

  • Status changed from Open to Closed
Actions

Also available in: Atom PDF

Like0
Like0