Project

General

Profile

Actions

Bug #20172

closed

Socket.addrinfo failing randomly

Added by mwaldvogel (Michael Waldvogel) 4 months ago. Updated 3 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
[ruby-core:116129]

Description

I've recently updated one of my linux systems (Gentoo) to glibc 2.38 (that was the only change). After the update most of the time the below error happens. Among other things this breaks rubygems for me. I've reinstalled ruby 3.2.2 with rvm and didn't encounter the issue. The issue however remained even after reinstalling ruby 3.3.0 and even with ruby master. Since this goes back to getaddrinfo (which is working without any issues outside of ruby) and there seems to be only one bigger change to stdlib socket, I'm assuming the problem was introduced with https://bugs.ruby-lang.org/issues/19965

3.3.0 :001 > require 'socket'
 => true
3.3.0 :002 > Socket.getaddrinfo('rubygems.org', 443)
(irb):2:in `getaddrinfo': getaddrinfo: Temporary failure in name resolution (Socket::ResolutionError)
        from (irb):2:in `<main>'
        from <internal:kernel>:187:in `loop'
        from /usr/local/rvm/rubies/ruby-3.3.0/lib/ruby/gems/3.3.0/gems/irb-1.11.0/exe/irb:9:in `<top (required)>'
        from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `load'
        from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `<main>'
3.3.0 :003 > Socket.getaddrinfo('rubygems.org', 443)
(irb):3:in `getaddrinfo': getaddrinfo: Temporary failure in name resolution (Socket::ResolutionError)
        from (irb):3:in `<main>'
        from <internal:kernel>:187:in `loop'
        from /usr/local/rvm/rubies/ruby-3.3.0/lib/ruby/gems/3.3.0/gems/irb-1.11.0/exe/irb:9:in `<top (required)>'
        from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `load'
        from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `<main>'
3.3.0 :004 > Socket.getaddrinfo('rubygems.org', 443)
 =>
[["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 1, 6],
 ["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 2, 17],
 ["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 3, 0],
...
 ["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 1, 6],
 ["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 2, 17],
 ["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 3, 0]]
3.3.0 :005 >

Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #19965: Make the name resolution interruptibleClosedmame (Yusuke Endoh)Actions
Actions #1

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

  • Related to Feature #19965: Make the name resolution interruptible added

Updated by mame (Yusuke Endoh) 4 months ago

Yeah, it is probably due to the change of #19965. I cannot debug it soon because I don't have a gentoo environment. I suspect pthread_create is somehow failing. Does 10000.times { Thread.new {}.join } work successfully on your machine?

Updated by mame (Yusuke Endoh) 4 months ago

Incidentally, our Arch Linux CI also uses glibc 2.38, and it is working fine. So I guess either that it is a Gentoo-specific problem, or that your machine is so heavily loaded that it cannot pthread_create.

Updated by mwaldvogel (Michael Waldvogel) 4 months ago

We can at least exclude that it is due to heavy load. I will provide you access to one of the VMs by tomorrow. That way it should be easier to analyze.

Updated by mwaldvogel (Michael Waldvogel) 4 months ago

mame (Yusuke Endoh) wrote in #note-2:

Yeah, it is probably due to the change of #19965. I cannot debug it soon because I don't have a gentoo environment. I suspect pthread_create is somehow failing. Does 10000.times { Thread.new {}.join } work successfully on your machine?

Yes, 10000.times { Thread.new {}.join } works without any problems.

Updated by mame (Yusuke Endoh) 4 months ago

I investigated the issue by using the VM access Michael gave me. (Thank you!) And I understand the issue.

It looks like sched_getcpu(3) returns an unexpected number in the environment. Since the number of CPUs in the VM is 2, I expect it to return 0 or 1. However, it actually returns 0 or 123. This makes pthread_create fail with EINVAL because of a wrong affinity configuration.

TBH, I don't know why sched_getcpu(3) returns a strange value, but I guess it may depend on the configuration of the virtual environment.

I decided to remove the setaffinity mechanism and confirmed that it solves the issue: https://github.com/ruby/ruby/pull/9479
I introduced the mechanism to reduce the overhead of thread context switch, but a quick benchmark showed that removing it didn't seem to degrade the performance. So I'd like to simply delete the troublesome code.

Actions #7

Updated by mame (Yusuke Endoh) 4 months ago

  • Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED
Actions #8

Updated by mame (Yusuke Endoh) 4 months ago

  • Status changed from Open to Closed

Applied in changeset git|1bd98c820da46a05328d2d53b8f748f28e7ee8f7.


Remove setaffinity of pthread for getaddrinfo

It looks like sched_getcpu(3) returns a strange number on some
(virtual?) environments.

I decided to remove the setaffinity mechanism because the performance
does not appear to degrade on a quick benchmark even if removed.

[Bug #20172]

Updated by ioquatix (Samuel Williams) 3 months ago

For reference, I had a user report a similar issue due to Addrinfo#ip_address: https://github.com/socketry/falcon/issues/217

Updated by naruse (Yui NARUSE) 3 months ago

  • Backport changed from 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: DONE

ruby_3_3 53d4e9c4bbba077a569549a01a8263e5e8f59ee8 merged revision(s) 1bd98c820da46a05328d2d53b8f748f28e7ee8f7.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0