Bug #11522
closedURI::decode returns incorrectly encoding strings
Description
When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.
irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬'
=> "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false
I would expect decoded to have a valid encoding - probably as UTF-8?
Updated by charlieda (Charlie Anderson) over 9 years ago
- Assignee set to akira (akira yamada)
Updated by nobu (Nobuyoshi Nakada) over 9 years ago
It has no hints for encoding.
Updated by usa (Usaku NAKAMURA) over 9 years ago
I agree with you, nobu.
But, it should be ASCII-8BIT, not US-ASCII.
Updated by nobu (Nobuyoshi Nakada) over 9 years ago
- Status changed from Open to Rejected
Firstly, URI.unescape
is obsolete.
CGI.unescape
, which sets the encoding to @@accept_charset
, may work for you.
Updated by duerst (Martin Dürst) over 9 years ago
Nobuyoshi Nakada wrote:
It has no hints for encoding.
In theory, that's correct. In practice, there are several better possibilities.
-
We can add an additional parameter that indicates the encoding.
-
We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily.
-
We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0.
-
and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).