Project

General

Profile

Actions

Bug #13624

closed

MinGW - TestIO#test_copy_stream_no_busy_wait - new failure

Added by MSP-Greg (Greg L) almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.5.0dev (2017-06-02 trunk 58998) [x64-mingw32]
[ruby-core:81534]

Description

Today's MinGW build had a new failure, one that I don't recall seeing before.

Code is (located here):

def test_copy_stream_no_busy_wait
  msg = 'r58534 [ruby-core:80969] [Backport #13533]'
  IO.pipe do |r,w|
    r.nonblock = true
    assert_cpu_usage_low(msg, pct: 0.11) do
      th = Thread.new { IO.copy_stream(r, IO::NULL) }
      sleep 0.1
      w.close
      th.join
    end
  end
end

Failure (I show skips, this is the 4th of 7 failures):

 88) Failure:
TestIO#test_copy_stream_no_busy_wait [E:/GitHub/ruby/test/ruby/test_io.rb:539]:
r58534 [ruby-core:80969] [Backport #13533].
Expected 0.01599999999996271 to be <= 0.013757898662006482.

From the assert assert_cpu_usage_low, it seems like a test that might be kind of arbitrary, especially when a user (or the OS) might be doing another task at the exact same time. Also, I'm not quite sure how 0.11 relates to the two numbers shown in the failure...

Again, I've never seen it before, and I'm ignoring it.

Updated by normalperson (Eric Wong) almost 5 years ago

wrote:

Bug #13624: MinGW - TestIO#test_copy_stream_no_busy_wait - new failure
https://bugs.ruby-lang.org/issues/13624

Today's MinGW build had a new failure, one that I don't recall seeing before.

Yes, it is a new test which already had some problems with
lower clock resolution systems [ruby-core:81427]

I admit it is a tricky test; and made trickier by the desire
to have shorter tests in our test suite.

  1. Failure:
    TestIO#test_copy_stream_no_busy_wait [E:/GitHub/ruby/test/ruby/test_io.rb:539]:
    r58534 [ruby-core:80969] [Backport #13533].
    Expected 0.01599999999996271 to be <= 0.013757898662006482.

Curious, do you know what the kernel timer resolution is on your
system?

For example; old Linux was always 100 Hz, but modern defaults
are 250 Hz (via CONFIG_HZ in the kernel build), and 300 Hz and
1000 Hz are allowed for people who need higher resolution (for
audio or other low-latency purpose).

Also, r58935 changed Process.times to use the getrusage syscall
for better resolution on Linux and *BSD; but maybe that is a
regression on Win32...

However; I notice gc.c::getrusage_time has a _WIN32
implementation which does not use getrusage; so I also wonder if
you can try getting Process.times to use that.

From the assert assert_cpu_usage_low, it seems like a test that might be kind of arbitrary, especially when a user (or the OS) might be doing another task at the exact same time. Also, I'm not quite sure how 0.11 relates to the two numbers shown in the failure...

I tried to increase the number to work with 100 Hz kernels and
100ms sleep. But maybe that is not enough for other kernels.

I'll probably change it to the actual assert method to have
a lower bound for time; since writing a short test to test
CPU usage is challenging (and long tests are annoying...)

Again, I've never seen it before, and I'm ignoring it.

I will add a minimum bound to assert_cpu_usage_low; but
I also suggest you take a look at the Process.times
implementation on Win32 to see if that can be improved.

Updated by MSP-Greg (Greg L) almost 5 years ago

Eric,

Thanks for the response. Reminder (which I try to mention from time to time) - I'm not a c type, and I essentially just use Windows.

Curious, do you know what the kernel timer resolution is on your system?

No idea. But this might be implying about 15 ms per, so about 67 Hz.

Sorry, I try to check for commits related to new test failures, I missed 58936 from 6 days ago. Conversely, I've been running daily builds, and today was the first error.

I don't use the system when building, but since I've got console test output piped to a file, sometimes I might open Windows (file) explorer to check the log size to see how far along it is. I suspect I timed it just right today...

Summing up, I haven't tried to see if I can replicate the failure. If it's helpful, if I patch pct: 0.11 up to 0.20 or something, then run tests as normal for several days, I can report back as to whether the failure occurs again.

Or, I could just ignore the failure.

Updated by normalperson (Eric Wong) almost 5 years ago

wrote:

Thanks for the response. Reminder (which I try to mention from time to time) - I'm not a c type, and I essentially just use Windows.

Maybe you'll become a "c type" anyways :)
I never set out to be, either, or even any sort of developer;
just kept editing+retrying until stuff stopped breaking.

Curious, do you know what the kernel timer resolution is on your system?

No idea. But this might be implying about 15 ms per, so about 67 Hz.

Thanks, I made r59003 based on that info.

Your original message was around 15ms, so that is definitely
inline with what's documented, there.

Please let us know if it still fails again. Thanks.

And yeah, maybe you can figure out a way to improve
Process.times for other Windows users.

Updated by MSP-Greg (Greg L) almost 5 years ago

Eric,

Thanks for the patch. I'll build tomorrow, isolate the test, then run it a few hundred times...

And yeah, maybe you can figure out a way to improve Process.times for other Windows users.

After I learn c, I'll get right on that. Right now, I've got some old code (I wrote, haven't looked at for a few years) that now has the following error (I have to use Ruby 2.3, embedded in another app, calling an MSIE html control):

Error: #<FiberError: fiber called across stack rewinding barrier>

Google doesn't return a whole lot of hits...

Updated by MSP-Greg (Greg L) almost 5 years ago

Sorry for the delay. Ran the test with and without -j, several hundred runs, no failures.

Please close.

Thank you.

Actions #6

Updated by vo.x (Vit Ondruch) almost 5 years ago

  • Status changed from Open to Closed
Actions

Also available in: Atom PDF