Project

General

Profile

Bug #14997

Socket connect timeout exceeds the timeout value for

Added by maciej.mensfeld (Maciej Mensfeld) over 1 year ago. Updated 5 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:88500]

Description

Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132

and when connect_timeout is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach N * 4.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

  • TCP server (event machine) behind an AWS NLB
  • TCP server process goes down behind NLB but NLB is still responsive
  • Socket connect_timeout is set to 100ms
  • AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
  • Ruby timeouts after 100ms
  • Ruby tries to connect to the next IP from the pool (AWS NLB again)
  • Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664


Related issues

Related to Ruby master - Feature #15553: Addrinfo.getaddrinfo supports timeoutClosedActions

History

#1

Updated by maciej.mensfeld (Maciej Mensfeld) over 1 year ago

  • Description updated (diff)

Updated by maciej.mensfeld (Maciej Mensfeld) about 1 year ago

If anyone is actually willing to confirm, that it is indeed an unwanted / unexpected behavior, I offer to fix it.

It could be fixed by tracking how much of the time "pool" has been used and lowering the timeout value appropriate for the next attempts. That would guarantee, that we would never exceed the timeout.

I think this is the most elegant solution.

Updated by tenderlovemaking (Aaron Patterson) 5 months ago

This really sounds like a bug to me. Please make a patch and I will apply it.

#4

Updated by Glass_saga (Masaki Matsushita) 4 months ago

  • Related to Feature #15553: Addrinfo.getaddrinfo supports timeout added

Also available in: Atom PDF