URI#to_s can serialize to a value that doesn't deserialize to the original
It appears that when we're serializing a URI to string, we don't check/confirm that it is represented in a form that can be deserialized back into the original. I think it's fair to expect that serialize+deserialize produces an object that is the same as the original, only differing perhaps in terms of the unique object identifier. This isn't the case with URI when they are custom built, which might happen a lot, for example in a Rails app that accepts URL inputs from users. Let me attempt a reproduction, using the generic URI
example_url = "example.com"
example_uri = URI(example_url)
Given that no scheme is explicitly set in the URI, it is correctly parsed as generic, with the given
example.com interpreted as the path.
The object returned to is mutable. Since we didn't automatically detect a scheme, let's fix that as well as the hostname.
example_uri.scheme = "https"
example_uri.hostname = example_uri.path
# I've intentionally left path value unchanged, since it helps demonstrate the potential bug.
Given that we have a scheme, an authority, and a path, and given that we format URI according to RFC 3986, one may expect that serializing the URI to string will follow the guidelines of section 3 of the RFC: Syntax Components, which requires a slash separator between the authority (in our case hostname) and the path. It appears that
URI#to_s may not do that if path didn't already have a slash prefix. Which would be fine if we were keeping an invariant that ensured that we never produced bad serialized URI. To return to our
example_uri, serialization produces:
serialized_uri = example_uri.to_s
puts serialized_uri # https://example.comexample.com
This is obviously bad. One would have expected
https://example.com/example.com instead. That is, the slash will be automatically and correctly inserted, just as the double slashes were automatically inserted between the scheme and and the authority.
serialized_uri cannot be deserialized into
example_uri, in fact. Below is an attempt at deserialization and a comparison of the new value to the original:
deserialized_example_uri = URI(serialized_uri)
example_uri.scheme == deserialized_example_uri.scheme # true
example_uri.hostname == deserialized_example_uri.hostname # false (for, example.com =/= example.comexample.com)
example_uri.path == deserialized_example_uri.path # false (for, example.com =/= "")
I believe that the ability to serialize and deserialize an object without losing fidelity is a great thing. I believe even more strongly that we should preserve/maintain an invariant that allows us to always serialize a URI to a format that meets the RFC's specification. Therefore I consider this a bug, and I'd be willing to work on a fix, as my first contribution to Ruby, if enough people consider it a bug too.
Updated by jeremyevans0 (Jeremy Evans) 4 months ago
I agree this is a bug. There's a few ways to fix it:
- Convert the relative path to absolute path when host/port is set.
- Raise if path is relative when setting host/port.
- Change URI#to_s to add a slash for a relative path if there is a host/port.
- Make URI#to_s raise if there is a host/port and a relative path
Of these options, I think option 3 is best. The main problem with option 1 is that information is that uri.path changes. I think if the uri was set with a relative path, then uri.path should be relative. Raising an error is possible by calling
check_path in some additional methods, but it results in weird errors (invalid path even though you aren't trying to set a new path), and won't fix the issue for ftp, where relative paths are allowed. So I don't think option 2 is good either. Option 4 would work, but most code does not expect
to_s to raise.
I submitted a pull request upstream for option 3: https://github.com/ruby/uri/pull/90
Updated by knu (Akinori MUSHA) 4 months ago
It is not appropriate to parse a path string with URI() without specifying any scheme or base URI.
If you know the piece of text is a path in the HTTP protocol, you should say something like this:
url = URI.join("https:", "example.com")
p url #=> #<URI::HTTPS https:/example.com>
url.hostname = "foo"
p url #=> #<URI::HTTPS https://foo/example.com>
p url.to_s #=> "https://foo/example.com"