Project

General

Profile

Bug #6358

arm-linux : sleep() time dependent threading bug

Added by stevegoobermanhill (stephen gooberman-hill) almost 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 1.9.2p136 (2010-12-25 revision 30365) [arm-linux]
Backport:
[ruby-core:44621]

Description

Hi,
I have a multi-threading bug on a low-speed (180MHz) arm-linux platform. The bug can be consistantly produced or removed by varying the length of a sleep() interval at the end of a piece of code invoked in a Thread (the GServer#serve(io) method)

I am unable to completely isolate the bug, and the trace I get off it is intermittent, but when it occurs it terminates the program instantly. The debugging info that I get when the program fails is dependent on the sleep timer.

I have once seen a ruby fault, with a control frame backtrace (3 levels) showing a cfp consistancy fault (sorry - not sure which one, but send I think - my fault for losing the trace). However, I can consistently get a trace claiming a ThreadError

The lines referenced are marked in the code.

OK - the following piece of code (log statements removed) is taken from a class derived from GServer, and implements the #serve method. The line marked with the arrow (==>) at the bottom is critical in controlling the appearance of the bug.

The bug can be made to appear by running a sequence of unit tests to stress this piece of code (there are 5 test methods in the sequence)

  • If the sleep length is 0.01, the program (a test suite) will crash on the first test, with no backtrace or debugging info

  • If the sleep length is 0.1, or if the sleep length is 0.01 AND ruby is invoked with --debug then the program will crash at a variable point

    in the test sequence, giving the following trace
    Exception ThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88 - must be called with a block
    must be called with a block
    Exception
    ThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:118 - must be called with a block
    /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in initialize'
    /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in
    eof'
    /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in `serve'
    This sleep length has also (once) given some control frame information.

  • If the sleep length is 0.5, the test sequence will always run correctly.

I suspect that the issue is somewhere deep in the Thread handling code - possibly the ruby interpreter is not giving the underlying linux system enough time to clear up and reallocate threads; this results in the crash.

Hope this is of some help. I do have a workaround at this point (lengthen the timer to 0.5s), but this may be an issue on slower platforms.

Kind regards

Steve

====================== simplified code (/mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88) ===================
====================== lines referred to in error are commented ===================
====================== sleep statement which can trigger or remove the bug marked with ==> ===================

class TCPMultiServer < GServer
def serve(io)
begin
io.sync=true
io.fcntl(Fcntl::F_SETFL, io.fcntl(Fcntl::F_GETFL) | Fcntl::O_NONBLOCK)

    while not stopped? 
      req=nil
      unless io.eof                                                           #line 88 (referenced in ThreadError)
        req = io.read(7) 

        destination=req.getbyte(6)
        multicast= ( destination == 0 ) #true if multicast message

        active_devices= multicast ? @devices.values : [@devices[destination]]
        active_devices.compact!

        tr = req[0,2]
        len = req[4,2].unpack('n')[0]
        req = io.read(len - 1) unless io.eof

        active_devices.each do |device|    
          pdu = device.serve(req)

          resp = tr + "\0\0" + (pdu.size + 1).to_word + device.uid.chr + pdu
          io.write resp
        end
      end

==> sleep(0.01)
end
rescue Exception => ex
STDERR.puts ex
raise ex #line 118 (referenced in ThreadError)
end
end
end


Related issues

Related to Ruby master - Bug #5957: pthreads not working on ulibc (linuxthreads)Rejected02/01/2012Actions

History

Updated by mame (Yusuke Endoh) almost 8 years ago

  • Status changed from Open to Feedback

At first, please try Ruby trunk.

If the problem still occurs, could you please create a patch yourself?
There is no arm maintainer unfortunately.

If your patch looks benign for other platforms, we may import it.

Thanks,

--
Yusuke Endoh mame@tsg.ne.jp

Updated by stevegoobermanhill (stephen gooberman-hill) almost 8 years ago

Hi Yusuke-san,
I'm not sure that I will be able to create a patch - you can blame Matz :-) If he hadn't written such a brilliant language I would still be competent in C. :-) As it is, while I can still competently read and understand C code, I don't think I have written any since about 2003.

Thinking about the problem a bit, my suspicion is that somewhere (either in ruby, or the pthreads library, or in the interaction between them), a thread is being reused before it is properly torn down, and this is causing the problem. A gross simplification of the code (which I have yet to try and use to reproduce the bug) is

timeout=0.01 #this is the critical value
while true
Thread.new do
Thread.new { sleep(timeout) }
end
end

My reasoning is that the bug may be quite deep as it generally exits without giving any backtrace, and is very dependent on the critical value. I don't know about the deep design of the ruby code or the pthreads library, but I am wondering whether it is possible that the there is a timer used to give the system time to tidy up before a thread is reused, and that this timer simply times out too quickly for the platform.

Are you aware of any similar problem on a slower (embedded) platform? It may be a problem that is specific to arm-linux, or to the ulibc pthreads library, or one that only turns up on slow processors.

In the mean time, I will see if I can produce a simple test case which demonstrates the problem. That would be a start.

FYI, I am putting a significant number of embedded arm-linux devices into the field this year (on the Techbase NPE series GPRS modem / industrial computer). I would be happy to continue to contribute bug reports - I have ruby 1.8.7 working happily on this device, but 1.9 is a little more challenging at the moment. Anything that I can contribute, I will do so.

Kind regards,

Steve

Updated by naruse (Yui NARUSE) almost 8 years ago

Could you write a reproducible code with C?
If you can, it is an issue of ulibc's pthread implementation.

The root of sleep(timeout) is pthread_cond_timedwait(3).
So I suspect it has a bug.

Updated by stevegoobermanhill (stephen gooberman-hill) almost 8 years ago

Hi Yusuke-san
Resolved - duplicated #5343

further investigation leads me to believe that this is actually a duplicate of Bug #5343 as I can consistently make that bug happen on the arm-linux platform. Therefore please close this bug.

However, the actual conditions that give rise to this bug are made more likely by the design of the GServer#serve method in the standard library. I will therefore open a separate bug report for the GServer issue - I have a patch which I believe improves the design of GServer

Kind regards

Steve

Updated by mame (Yusuke Endoh) about 7 years ago

  • Status changed from Feedback to Closed

Also available in: Atom PDF