Project

General

Profile

Bug #8352

URI squeezes a sequence of slashes in merging paths when it shouldn't

Added by knu (Akinori MUSHA) over 4 years ago. Updated 3 days ago.

Status:
Open
Priority:
Normal
Target version:
-
ruby -v:
ruby 2.1.0dev (2013-05-01 trunk 40540) [x86_64-freebsd9]
Backport:
[ruby-core:54729]

Description

RFC 2396 (on which the library currently is based) or RFC 3986 says nothing about a sequence of slashes in the path part except for parsing rules when a URI (path) starts with two slashes.

It should be perfectly valid to have a slash right after another, and there is no reason to "normalize" a sequence of slashes into a single slash, which uri actually does in merging paths:

URI.parse('http://example.com/foo//bar/')+'.'
=> #<URI::HTTP:0x0000080303d2b0 URL:http://example.com/foo/bar/>

Fixing this may be as easy as changing the regexp in URI::Generic#split_path from %r{/+} to %r{/}, but I wonder how the impact of incompatibility it may introduce would be.


Related issues

Related to Ruby trunk - Feature #2542: URI lib should be updated to RFC 3986Closed2010-01-01

History

#1 [ruby-core:54730] Updated by knu (Akinori MUSHA) over 4 years ago

s/RFC 2896/RFC 2396/

#2 [ruby-core:66033] Updated by naruse (Yui NARUSE) about 3 years ago

  • Description updated (diff)

#3 Updated by knu (Akinori MUSHA) 4 days ago

  • Backport deleted (1.9.3: UNKNOWN, 2.0.0: UNKNOWN)
  • Description updated (diff)
  • Subject changed from uri squeezes a sequence of slashes in merging paths when it shouldn't to URI squeezes a sequence of slashes in merging paths when it shouldn't

#4 [ruby-core:83784] Updated by knu (Akinori MUSHA) 3 days ago

Addressable::URI (of the addressable gem) properly preserves sequences of slashes in a path, so it is a workaround to use it instead.

I've confirmed that net/url of Go, URI of Perl, urlparse.urljoin of Python2 or java.net.URL of Java never does this kind of unwanted normalization.

A single exception I could find, however, was urllib.parse of Python3. (!)

% python3
Python 3.6.3 (default, Nov  4 2017, 01:15:26)
[GCC 4.2.1 Compatible FreeBSD Clang 3.8.0 (tags/RELEASE_380/final 262564)] on freebsd11
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://example.com/foo//bar/baz', '.')
'http://example.com/foo/bar/'

I'm not sure if this is an intentional change from Python2, but I believe any slash in the path part should be retained.

#5 [ruby-core:83785] Updated by knu (Akinori MUSHA) 3 days ago

I've also checked the url module of node.js and it didn't, neither. Their test cases do not include explicit examples of how to deal with sequences of slashes in a path, but there are some occurrences of double-slash retained in the expected results of relative path resolution, which means double-slash is not a subject of squeezing.

Looking into WHATWG URL spec, there's no indication that a sequence of slashes in a URL path should be treated specially. A path is simply a "list" of "items" separated with the slash (/, U+002F) and any item can naturally be an empty string. Even when resolving a "double-dot segment" and consequently "removing" a path "item" you are never told to "remove" extra items that are empty.

So, as you can see, Ruby and Python3 are the only exceptions, there's no specification that indicates that a sequence of slashes in a URL path should be treated specially, and the majority of library implementations found in other languages supports that. I presume there are few programmers who would rely on the current behavior.

Also available in: Atom PDF