Project

General

Profile

Actions

Bug #8352

closed

URI squeezes a sequence of slashes in merging paths when it shouldn't

Added by knu (Akinori MUSHA) almost 11 years ago. Updated over 6 years ago.

Status:
Closed
Target version:
ruby -v:
ruby 2.1.0dev (2013-05-01 trunk 40540) [x86_64-freebsd9]
Backport:
[ruby-core:54729]

Description

RFC 2396 (on which the library currently is based) or RFC 3986 says nothing about a sequence of slashes in the path part except for parsing rules when a URI (path) starts with two slashes.

It should be perfectly valid to have a slash right after another, and there is no reason to "normalize" a sequence of slashes into a single slash, which uri actually does in merging paths:

URI.parse('http://example.com/foo//bar/')+'.'
=> #<URI::HTTP:0x0000080303d2b0 URL:http://example.com/foo/bar/>

Fixing this may be as easy as changing the regexp in URI::Generic#split_path from %r{/+} to %r{/}, but I wonder how the impact of incompatibility it may introduce would be.


Files


Related issues 2 (0 open2 closed)

Related to Ruby master - Feature #2542: URI lib should be updated to RFC 3986Closednaruse (Yui NARUSE)01/01/2010Actions
Has duplicate Ruby master - Bug #12562: URI merge removes empty segment contrary to RFC 3986ClosedActions

Updated by knu (Akinori MUSHA) almost 11 years ago

s/RFC 2896/RFC 2396/

Updated by naruse (Yui NARUSE) over 9 years ago

  • Description updated (diff)
Actions #3

Updated by knu (Akinori MUSHA) over 6 years ago

  • Subject changed from uri squeezes a sequence of slashes in merging paths when it shouldn't to URI squeezes a sequence of slashes in merging paths when it shouldn't
  • Description updated (diff)
  • Backport deleted (1.9.3: UNKNOWN, 2.0.0: UNKNOWN)

Updated by knu (Akinori MUSHA) over 6 years ago

Addressable::URI (of the addressable gem) properly preserves sequences of slashes in a path, so it is a workaround to use it instead.

I've confirmed that net/url of Go, URI of Perl, urlparse.urljoin of Python2 or java.net.URL of Java never does this kind of unwanted normalization.

A single exception I could find, however, was urllib.parse of Python3. (!)

% python3
Python 3.6.3 (default, Nov  4 2017, 01:15:26)
[GCC 4.2.1 Compatible FreeBSD Clang 3.8.0 (tags/RELEASE_380/final 262564)] on freebsd11
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://example.com/foo//bar/baz', '.')
'http://example.com/foo/bar/'

I'm not sure if this is an intentional change from Python2, but I believe any slash in the path part should be retained.

Updated by knu (Akinori MUSHA) over 6 years ago

I've also checked the url module of node.js and it didn't, neither. Their test cases do not include explicit examples of how to deal with sequences of slashes in a path, but there are some occurrences of double-slash retained in the expected results of relative path resolution, which means double-slash is not a subject of squeezing.

Looking into WHATWG URL spec, there's no indication that a sequence of slashes in a URL path should be treated specially. A path is simply a "list" of "items" separated with the slash (/, U+002F) and any item can naturally be an empty string. Even when resolving a "double-dot segment" and consequently "removing" a path "item" you are never told to "remove" extra items that are empty.

So, as you can see, Ruby and Python3 are the only exceptions, there's no specification that indicates that a sequence of slashes in a URL path should be treated specially, and the majority of library implementations found in other languages supports that. I presume there are few programmers who would rely on the current behavior.

Updated by duerst (Martin Dürst) over 6 years ago

knu (Akinori MUSHA) wrote:

I presume there are few programmers who would rely on the current behavior.

I agree that there should be few programmers who would rely on subsequent slashes to be collapsed to a single slash. However, I also think it's a bad idea for programmers or users to rely on multiple consecutive slashes to be preserved. Using multiple consecutive slashes in an URI is a bad idea.

Updated by phluid61 (Matthew Kerwin) over 6 years ago

duerst (Martin Dürst) wrote:

Using multiple consecutive slashes in an URI is a bad idea.

It definitely doesn't play nicely with dot-segment resolution, but then I wouldn't want to bear the burden of deciding how to resolve that, one way or the other.

In this particular case, I think it is incorrect to automatically remove empty segments, but I also think it's bad to have them in the first place.

What if there was a way for the programmer to explicitly invoke the current behaviour (e.g. by sending a different message), so the side-effect is expected?

Updated by knu (Akinori MUSHA) over 6 years ago

Naruse-san, could you review the attached patch?

Actions #9

Updated by knu (Akinori MUSHA) over 6 years ago

  • Target version set to 2.5
Actions #10

Updated by knu (Akinori MUSHA) over 6 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r61218.


Allow empty path components in a URI [Bug #8352]

  • generic.rb (URI::Generic#merge, URI::Generic#route_to): Fix a bug
    where a sequence of slashes in the path part gets collapsed to a
    single slash. According to the relevant RFCs and WHATWG URL
    Standard, empty path components are simply valid and there is no
    special treatment defined for them, so we just keep them as they
    are.
Actions #11

Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago

  • Has duplicate Bug #12562: URI merge removes empty segment contrary to RFC 3986 added
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0