Bug #11522
closedURI::decode returns incorrectly encoding strings
Description
When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.
irb(main):026:0* unicode = 'œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬'
=> "œ´å∑®´ß∂†≈©ƒç˙©√∆˙∫˚∆~¬"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false
I would expect decoded to have a valid encoding - probably as UTF-8?
        
           Updated by charlieda (Charlie Anderson) about 10 years ago
          Updated by charlieda (Charlie Anderson) about 10 years ago
          
          
        
        
      
      - Assignee set to akira (akira yamada)
        
           Updated by nobu (Nobuyoshi Nakada) about 10 years ago
          Updated by nobu (Nobuyoshi Nakada) about 10 years ago
          
          
        
        
      
      It has no hints for encoding.
        
           Updated by usa (Usaku NAKAMURA) about 10 years ago
          Updated by usa (Usaku NAKAMURA) about 10 years ago
          
          
        
        
      
      I agree with you, nobu.
But, it should be ASCII-8BIT, not US-ASCII.
        
           Updated by nobu (Nobuyoshi Nakada) about 10 years ago
          Updated by nobu (Nobuyoshi Nakada) about 10 years ago
          
          
        
        
      
      - Status changed from Open to Rejected
Firstly, URI.unescape is obsolete.
CGI.unescape, which sets the encoding to @@accept_charset, may work for you.
        
           Updated by duerst (Martin Dürst) about 10 years ago
          Updated by duerst (Martin Dürst) about 10 years ago
          
          
        
        
      
      Nobuyoshi Nakada wrote:
It has no hints for encoding.
In theory, that's correct. In practice, there are several better possibilities.
- 
We can add an additional parameter that indicates the encoding. 
- 
We can default to UTF-8. That's because most URIs that contain non-ASCII byte values these days are based on UTF-8, and their percentage is increasing steadily. 
- 
We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0. 
- 
and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).