Project

General

Profile

Actions

Bug #8241

closed

If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError'

Added by neocoin (Sangmin Ryu) almost 11 years ago. Updated about 7 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]
[ruby-core:54138]

Description

First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.

If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::InvalidURIError'

ex)
=begin

require 'uri'
URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part: test_strin.helo.com (or bad hostname?)
from ... /.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'

e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry part: test_string.hello.com (or bad hostname?)>
puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in initialize' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in new' .../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

URI.parse('http://teststring.hello.com')
#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368
( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil] // normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil] // wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/repository/entry/lib/uri/common.rb#L368
( https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L368 )

=begin
# hostname = *( domainlabel "." ) toplabel [ "." ]
# reg-name = *( unreserved / pct-encoded / sub-delims ) # RFC3986
unless hostname
ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\-.]|%\h\h)+"
end
=end

Through you could check source comment, 'reg-name' in rfc3986 could be 'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 ( http://tools.ietf.org/html/rfc3986#section-2.3 )

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\-._~]|%\h\h)+"
=end


Files

edit_hostname_pattern.patch (152 Bytes) edit_hostname_pattern.patch patch file neocoin (Sangmin Ryu), 04/09/2013 09:05 PM

Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #9974: Regression: URI.parse allows invalid URIsRejectedActions

Updated by naruse (Yui NARUSE) almost 11 years ago

uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/

Updated by neocoin (Sangmin Ryu) almost 11 years ago

naruse (Yui NARUSE) wrote:

uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/

Thank for feedback.

'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )

So rfc3986 based comment in uri/common.rb is right. Check plz.

Updated by naruse (Yui NARUSE) almost 11 years ago

neocoin (Sangmin Ryu) wrote:

naruse (Yui NARUSE) wrote:

uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/

Thank for feedback.

'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )

So rfc3986 based comment in uri/common.rb is right. Check plz.

Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt

And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based Naming Authority.
It says

  server        = [ [ userinfo "@" ] hostport ]
  userinfo      = *( unreserved | escaped |
                     ";" | ":" | "&" | "=" | "+" | "$" | "," )
  hostport      = host [ ":" port ]
  host          = hostname | IPv4address
  hostname      = *( domainlabel "." ) toplabel [ "." ]
  domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
  toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

Updated by neocoin (Sangmin Ryu) almost 11 years ago

naruse (Yui NARUSE) wrote:

neocoin (Sangmin Ryu) wrote:

naruse (Yui NARUSE) wrote:

uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
http://url.spec.whatwg.org/

Thank for feedback.

'rfc2373' is just ip v6 addressing part. This doen't include whole URI definition.
( http://tools.ietf.org/html/rfc2373 )

So rfc3986 based comment in uri/common.rb is right. Check plz.

Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt

And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based Naming Authority.
It says

  server        = [ [ userinfo "@" ] hostport ]
  userinfo      = *( unreserved | escaped |
                     ";" | ":" | "&" | "=" | "+" | "$" | "," )
  hostport      = host [ ":" port ]
  host          = hostname | IPv4address
  hostname      = *( domainlabel "." ) toplabel [ "." ]
  domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
  toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

Yes, you are right. I checked rfc2396 (published in Aug 1998) too through commented 'uri/common.rb'.
That document is URI general syntax starting point.
And in January 2005, rfc 3986 was published by rfc 2396 co-author.
(See also http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Refinement_of_specifications )
As result, rfc3986 is current standard

I think, many web service companies (ex - ddns or private address for blog company) use rfc3986 to be standard.

When I make a web crawler with ruby, second level domain ( google.com 's 'google' part) generally don't have
a underscore and tild. I know, DNS hosting service don't permit underscore at second level domain.
But many third domains have underscore character. ( hello_world.google.com 's 'hello_world' part).

So I check URI spec in rfc3986 several years ago and post this issue.

Find below string in http://tools.ietf.org/html/rfc3986#appendix-A

Appendix A. Collected ABNF for URI
...
host = IP-literal / IPv4address / reg-name
...
reg-name = *( unreserved / pct-encoded / sub-delims )
...
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

See also.
Python urlparse method include rfc3986
http://docs.python.org/2/library/urlparse.html

Updated by naruse (Yui NARUSE) almost 10 years ago

  • Related to Bug #9974: Regression: URI.parse allows invalid URIs added

Updated by coldnebo (Larry Kyrala) over 7 years ago

Here is a unobtrusive workaround using the documented capabilities of URI:

module URI
  DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?")
end

I also shared this on stackoverflow (http://stackoverflow.com/a/41048816/555187) because of the number of obtrusive patches floating around there.

Updated by naruse (Yui NARUSE) about 7 years ago

  • Status changed from Open to Closed

URI is upgraded into RFC 3986 at Ruby 2.2.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0