Bug #11522

URI::decode returns incorrectly encoding strings

Added by charlieda (Charlie Anderson) over 4 years ago. Updated over 4 years ago.

Target version:
ruby -v:
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]


When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.

irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬'
=> "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false

I would expect decoded to have a valid encoding - probably as UTF-8?


Updated by charlieda (Charlie Anderson) over 4 years ago

  • Assignee set to akira (akira yamada)

Updated by nobu (Nobuyoshi Nakada) over 4 years ago

It has no hints for encoding.


Updated by usa (Usaku NAKAMURA) over 4 years ago

I agree with you, nobu.
But, it should be ASCII-8BIT, not US-ASCII.


Updated by nobu (Nobuyoshi Nakada) over 4 years ago

  • Status changed from Open to Rejected

Firstly, URI.unescape is obsolete.
CGI.unescape, which sets the encoding to @@accept_charset, may work for you.


Updated by duerst (Martin Dürst) over 4 years ago

Nobuyoshi Nakada wrote:

It has no hints for encoding.

In theory, that's correct. In practice, there are several better possibilities.

1) We can add an additional parameter that indicates the encoding.

2) We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily.

3) We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0.

1) and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).

Also available in: Atom PDF