Bug #11522: URI::decode returns incorrectly encoding strings - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #11522

closed

URI::decode returns incorrectly encoding strings

Added by charlieda (Charlie Anderson) almost 10 years ago. Updated almost 10 years ago.

Status:

Rejected

Assignee:

akira (akira yamada)

Target version:

ruby -v:

ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN

[ruby-core:<unknown>]

Description

When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.

irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬'
=> "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false

I would expect decoded to have a valid encoding - probably as UTF-8?

Actions

Copy link

Updated by charlieda (Charlie Anderson) almost 10 years ago

Assignee set to akira (akira yamada)

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

It has no hints for encoding.

Actions

Copy link

Updated by usa (Usaku NAKAMURA) almost 10 years ago

I agree with you, nobu.
But, it should be ASCII-8BIT, not US-ASCII.

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

Status changed from Open to Rejected

Firstly, URI.unescape is obsolete.
CGI.unescape, which sets the encoding to @@accept_charset, may work for you.

Actions

Copy link

Updated by duerst (Martin Dürst) almost 10 years ago

Nobuyoshi Nakada wrote:

It has no hints for encoding.

In theory, that's correct. In practice, there are several better possibilities.

We can add an additional parameter that indicates the encoding.
We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily.
We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0.
and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #11522

URI::decode returns incorrectly encoding strings

Updated by charlieda (Charlie Anderson) almost 10 years ago

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

Updated by usa (Usaku NAKAMURA) almost 10 years ago

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

Updated by duerst (Martin Dürst) almost 10 years ago