Project

General

Profile

Bug #14997

Updated by maciej.mensfeld (Maciej Mensfeld) over 5 years ago

Given a case, where a domain is being resolved to multiple IPs (4 in the following example): 

 ``` 
 dig debug-xyz.elb.us-east-1.amazonaws.com a 

 ; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a 
 ;; global options: +cmd 
 ;; Got answer: 
 ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375 
 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0 

 ;; QUESTION SECTION: 
 ;debug-xyz.elb.us-east-1.amazonaws.com. IN A 

 ;; ANSWER SECTION: 
 debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79 
 debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24 
 debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55 
 debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167 

 ;; Query time: 4 msec 
 ;; SERVER: 172.31.0.2#53(172.31.0.2) 
 ;; WHEN: Tue Aug 14 13:46:18 UTC 2018 
 ;; MSG SIZE    rcvd: 132 
 ``` 

 and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`. 

 This can disrupt some time-sensitive systems. 

 We've experienced it with the following setup: 

 - TCP server (event machine) behind an AWS NLB 
 - TCP server process goes down behind NLB but NLB is still responsive 
 - Socket connect_timeout is set to 100ms 
 - AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't) 
 - Ruby timeouts after 100ms 
 - Ruby tries to connect to the next IP from the pool (AWS NLB again) 
 - Due to 4 hosts resolving, the overall timeout is 400ms. 

 Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit. 

 Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664

Back