Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

dig a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

; IN A

;; ANSWER SECTION: 60 IN A 60 IN A 60 IN A 60 IN A

;; Query time: 4 msec
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132

and when connect_timeout is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach N * 4.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

  • TCP server (event machine) behind an AWS NLB
  • TCP server process goes down behind NLB but NLB is still responsive
  • Socket connect_timeout is set to 100ms
  • AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
  • Ruby timeouts after 100ms
  • Ruby tries to connect to the next IP from the pool (AWS NLB again)
  • Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior:



If anyone is actually willing to confirm, that it is indeed an unwanted / unexpected behavior, I offer to fix it.

It could be fixed by tracking how much of the time "pool" has been used and lowering the timeout value appropriate for the next attempts. That would guarantee, that we would never exceed the timeout.

I think this is the most elegant solution.

