arm-linux : sleep() time dependent threading bug
I have a multi-threading bug on a low-speed (180MHz) arm-linux platform. The bug can be consistantly produced or removed by varying the length of a sleep() interval at the end of a piece of code invoked in a Thread (the GServer#serve(io) method)
I am unable to completely isolate the bug, and the trace I get off it is intermittent, but when it occurs it terminates the program instantly. The debugging info that I get when the program fails is dependent on the sleep timer.
I have once seen a ruby fault, with a control frame backtrace (3 levels) showing a cfp consistancy fault (sorry - not sure which one, but send I think - my fault for losing the trace). However, I can consistently get a trace claiming a ThreadError
The lines referenced are marked in the code.
OK - the following piece of code (log statements removed) is taken from a class derived from GServer, and implements the #serve method. The line marked with the arrow (==>) at the bottom is critical in controlling the appearance of the bug.
The bug can be made to appear by running a sequence of unit tests to stress this piece of code (there are 5 test methods in the sequence)
If the sleep length is 0.01, the program (a test suite) will crash on the first test, with no backtrace or debugging info
If the sleep length is 0.1, or if the sleep length is 0.01 AND ruby is invoked with --debug then the program will crash at a variable point
in the test sequence, giving the following trace
ThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88 - must be called with a blockThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:118 - must be called with a block
must be called with a block
This sleep length has also (once) given some control frame information.
If the sleep length is 0.5, the test sequence will always run correctly.
I suspect that the issue is somewhere deep in the Thread handling code - possibly the ruby interpreter is not giving the underlying linux system enough time to clear up and reallocate threads; this results in the crash.
Hope this is of some help. I do have a workaround at this point (lengthen the timer to 0.5s), but this may be an issue on slower platforms.
====================== simplified code (/mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88) ===================
====================== lines referred to in error are commented ===================
====================== sleep statement which can trigger or remove the bug marked with ==> ===================
class TCPMultiServer < GServer
io.fcntl(Fcntl::F_SETFL, io.fcntl(Fcntl::F_GETFL) | Fcntl::O_NONBLOCK)
while not stopped? req=nil unless io.eof #line 88 (referenced in ThreadError) req = io.read(7) destination=req.getbyte(6) multicast= ( destination == 0 ) #true if multicast message active_devices= multicast ? @devices.values : [@devices[destination]] active_devices.compact! tr = req[0,2] len = req[4,2].unpack('n') req = io.read(len - 1) unless io.eof active_devices.each do |device| pdu = device.serve(req) resp = tr + "\0\0" + (pdu.size + 1).to_word + device.uid.chr + pdu io.write resp end end
rescue Exception => ex
raise ex #line 118 (referenced in ThreadError)