Bug #8241
closedIf uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError'
Description
First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.
If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError'
ex)
=begin
require 'uri'
URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part: test_strin.helo.com (or bad hostname?)
from ... /.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry part: test_string.hello.com (or bad hostname?)>
puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:ininitialize' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:innew' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in
parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'
vs
URI.parse('http://teststring.hello.com')
#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end
This problem is made by hostname regex pattern of 'URI#split ' in uri/common.rb
https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368
( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 )
=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil] // normal
[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil] // wrong
source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368
( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 )
=begin
# hostname = *( domainlabel "." ) toplabel [ "." ]
# reg-name = *( unreserved / pct-encoded / sub-delims ) # RFC3986
unless hostname
ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\-.]|%\h\h)+"
end
=end
Through you could check source comment, 'reg-name' in rfc3986 could be 'unreserved / pct-encoded / sub-delims )'
And 'unreserved' definition in rfc3986 ( http://tools.ietf.org/html/rfc3986#section-2.3 )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
But hostname regex pattern has just '-' and '.' except '_' and '~'.
Please, check rfc3986 and add hostname pattern for reg-name like below.
=begin
ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\-._~]|%\h\h)+"
=end
Files
Updated by neocoin (Sangmin Ryu) over 11 years ago
Updated by naruse (Yui NARUSE) over 11 years ago
uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/
Updated by neocoin (Sangmin Ryu) over 11 years ago
naruse (Yui NARUSE) wrote:
uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/
Thank for feedback.
'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )
So rfc3986 based comment in uri/common.rb is right. Check plz.
Updated by naruse (Yui NARUSE) over 11 years ago
neocoin (Sangmin Ryu) wrote:
naruse (Yui NARUSE) wrote:
uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/Thank for feedback.
'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )So rfc3986 based comment in uri/common.rb is right. Check plz.
Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt
And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based Naming Authority.
It says
server = [ [ userinfo "@" ] hostport ]
userinfo = *( unreserved | escaped |
";" | ":" | "&" | "=" | "+" | "$" | "," )
hostport = host [ ":" port ]
host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
Updated by neocoin (Sangmin Ryu) over 11 years ago
naruse (Yui NARUSE) wrote:
neocoin (Sangmin Ryu) wrote:
naruse (Yui NARUSE) wrote:
uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/Thank for feedback.
'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )So rfc3986 based comment in uri/common.rb is right. Check plz.
Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt
And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based Naming Authority.
It saysserver = [ [ userinfo "@" ] hostport ] userinfo = *( unreserved | escaped | ";" | ":" | "&" | "=" | "+" | "$" | "," ) hostport = host [ ":" port ] host = hostname | IPv4address hostname = *( domainlabel "." ) toplabel [ "." ] domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum
Yes, you are right. I checked rfc2396 (published in Aug 1998) too through commented 'uri/common.rb'.
That document is URI general syntax starting point.
And in January 2005, rfc 3986 was published by rfc 2396 co-author.
(See also http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Refinement_of_specifications )
As result, rfc3986 is current standard
I think, many web service companies (ex - ddns or private address for blog company) use rfc3986 to be standard.
When I make a web crawler with ruby, second level domain ( google.com 's 'google' part) generally don't have
a underscore and tild. I know, DNS hosting service don't permit underscore at second level domain.
But many third domains have underscore character. ( hello_world.google.com 's 'hello_world' part).
So I check URI spec in rfc3986 several years ago and post this issue.
Find below string in http://tools.ietf.org/html/rfc3986#appendix-A
Appendix A. Collected ABNF for URI
...
host = IP-literal / IPv4address / reg-name
...
reg-name = *( unreserved / pct-encoded / sub-delims )
...
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
See also.
Python urlparse method include rfc3986
http://docs.python.org/2/library/urlparse.html
Updated by naruse (Yui NARUSE) over 10 years ago
- Related to Bug #9974: Regression: URI.parse allows invalid URIs added
Updated by coldnebo (Larry Kyrala) almost 8 years ago
Here is a unobtrusive workaround using the documented capabilities of URI:
module URI
DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?")
end
I also shared this on stackoverflow (http://stackoverflow.com/a/41048816/555187) because of the number of obtrusive patches floating around there.
Updated by naruse (Yui NARUSE) almost 8 years ago
- Status changed from Open to Closed
URI is upgraded into RFC 3986 at Ruby 2.2.