URI.parse does not validate components
happily return a
URI::HTTPS object, even though it has an invalid component and cannot be constructed using
This is because the parser uses the undocumented initializer which defaults to not validating the components. I would suggest to send that initializer the flag to allow validation or to use the build method instead from the parser.
Updated by jeremyevans0 (Jeremy Evans) 4 months ago
- Status changed from Open to Rejected
- File uri-parse-validate-15979.patch uri-parse-validate-15979.patch added
This is not a bug, and not related to validation. The reason for the behavior is that
URI.parse uses an RFC 3986 parser, while
URI::HTTPS.build uses an RFC 2396 parser. If you use
URI::HTTPS.new with an RFC 3986 parser and specify to validate the components, you get a valid URI:
URI::HTTPS.new( *URI::RFC3986_PARSER.split( "https://-._~%2C!$&'()*+,;=:@-._~%2C!$&'()*+,;=:/foo?/-._~%2C!$&'()*+,;=:@/?"), URI::RFC3986_PARSER, true)
The issue here is that the hostname you provide in the URI is invalid in RFC 2396 but valid in RFC 3986.
RFC 2396 ABNF:
host = hostname | IPv4address hostname = *( domainlabel "." ) toplabel [ "." ] domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum
RFC 3986 ABNF:
host = IP-literal / IPv4address / reg-name reg-name = *( unreserved / pct-encoded / sub-delims ) unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" pct-encoded = "%" HEXDIG HEXDIG sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
With the URI provided, the host is
-._~%2C!$&'()*+,;=, which is valid according to the RFC 3986 ABNF:
- : unreserved . : unreserved _ : unreserved ~ : unreserved %2C : pct-encoded ! : sub-delims $ : sub-delims & : sub-delims ' : sub-delims ( : sub-delims ) : sub-delims * : sub-delims + : sub-delims , : sub-delims ; : sub-delims = : sub-delims
As to why RFC 3986 is used in some places (parse/join/split) and RFC 2396 (all other places) is used in others, I believe it is related to backwards compatibility. Previously, There were some issues with
] not being allowed in query parts in RFC 3986 (#10402), but those are now worked around. However,
URI::RFC3986_Parser are not API compatible, so you cannot simply swap one for the other without breaking things.
In case you or someone else is interested in changing the default parser, attached is a minimal patch to make the RFC 3986 parser the default. It passes the URI tests, but I haven't done any testing beyond that. Hopefully it provides a decent starting point.