Bug #15979
closedURI.parse does not validate components
Description
URI.parse("https://-._~%2C!$&'()*+,;=:@-._~%2C!$&'()*+,;=:/foo?/-._~%2C!$&'()*+,;=:@/?")
happily return a URI::HTTPS
object, even though it has an invalid component and cannot be constructed using URI::HTTPS.build
This is because the parser uses the undocumented initializer which defaults to not validating the components. I would suggest to send that initializer the flag to allow validation or to use the build method instead from the parser.
Files
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
- Status changed from Open to Rejected
- File uri-parse-validate-15979.patch uri-parse-validate-15979.patch added
This is not a bug, and not related to validation. The reason for the behavior is that URI.parse
uses an RFC 3986 parser, while URI::HTTPS.build
uses an RFC 2396 parser. If you use URI::HTTPS.new
with an RFC 3986 parser and specify to validate the components, you get a valid URI:
URI::HTTPS.new(
*URI::RFC3986_PARSER.split(
"https://-._~%2C!$&'()*+,;=:@-._~%2C!$&'()*+,;=:/foo?/-._~%2C!$&'()*+,;=:@/?"),
URI::RFC3986_PARSER, true)
The issue here is that the hostname you provide in the URI is invalid in RFC 2396 but valid in RFC 3986.
RFC 2396 ABNF:
host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
RFC 3986 ABNF:
host = IP-literal / IPv4address / reg-name
reg-name = *( unreserved / pct-encoded / sub-delims )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
With the URI provided, the host is -._~%2C!$&'()*+,;=
, which is valid according to the RFC 3986 ABNF:
- : unreserved
. : unreserved
_ : unreserved
~ : unreserved
%2C : pct-encoded
! : sub-delims
$ : sub-delims
& : sub-delims
' : sub-delims
( : sub-delims
) : sub-delims
* : sub-delims
+ : sub-delims
, : sub-delims
; : sub-delims
= : sub-delims
As to why RFC 3986 is used in some places (parse/join/split) and RFC 2396 (all other places) is used in others, I believe it is related to backwards compatibility. Previously, There were some issues with [
and ]
not being allowed in query parts in RFC 3986 (#10402), but those are now worked around. However, URI::RFC2396_Parser
and URI::RFC3986_Parser
are not API compatible, so you cannot simply swap one for the other without breaking things.
In case you or someone else is interested in changing the default parser, attached is a minimal patch to make the RFC 3986 parser the default. It passes the URI tests, but I haven't done any testing beyond that. Hopefully it provides a decent starting point.