Bug #14997
Updated by maciej.mensfeld (Maciej Mensfeld) over 6 years ago
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):
```
dig debug-xyz.elb.us-east-1.amazonaws.com a
; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A
;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167
;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE rcvd: 132
```
and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.
This can disrupt some time-sensitive systems.
We've experienced it with the following setup:
- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.
Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.
Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664