Bug #7742
openSystem encoding (Windows-1258) is not recognized by Ruby to convert back to UTF-8
Description
I installed Railsinstaller in win8. After intall complete the screen set to
configuration Railsinstaller on cmd (step 2). I give user name: DHH Mars and
email: dhhma...@gmail.com. It ran and have following massage:C:/RailsInstaller/scripts/config_check.rb:64:in 'exist?': code converter not
found Encoding::ConverterNotFoundError from
C:/RailsInstaller/scripts/config_check.rb:64:in 'main'C:\Sites>
        
           Updated by duerst (Martin Dürst) over 12 years ago
          Updated by duerst (Martin Dürst) over 12 years ago
          
          
        
        
      
      Mars (Hong Ha Dang ) wrote:
C:/RailsInstaller/scripts/config_check.rb:64:in 'exist?': code converter not
found
Yes, windows-1258 (for Vietnamese) is currently not supported. The reason for this is because conversion from windows-1258 to UTF-8 should produce output in Unicode Normalization Form C. As an example, the sequence 0xE3 0xEC (LATIN SMALL LETTER A WITH BREVE followed by COMBINING ACCUTE ACCENT) should not be converted to the sequence U+0103 U+0301, but to the single character U+1EAF (LATIN SMALL LETTER A WITH BREVE AND ACCUTE).
This means that this bug depends on bug #6351. Unfortunately, I don't have time now to work on that bug; this will have to wait for March, sorry.
        
           Updated by duerst (Martin Dürst) over 12 years ago
          Updated by duerst (Martin Dürst) over 12 years ago
          
          
        
        
      
      - Assignee set to duerst (Martin Dürst)
- Target version set to 2.6
        
           Updated by thegcat (Felix Schäfer) almost 12 years ago
          Updated by thegcat (Felix Schäfer) almost 12 years ago
          
          
        
        
      
      =begin
We (((<Planio|URL:https://plan.io>))) are also in need of Windows-1258 to UTF-8 conversion, is there anything we can do to help?
=end
        
           Updated by duerst (Martin Dürst) almost 12 years ago
          Updated by duerst (Martin Dürst) almost 12 years ago
          
          
        
        
      
      thegcat (Felix Schäfer) wrote:
=begin
We (((<Planio|URL:https://plan.io>))) are also in need of Windows-1258 to UTF-8 conversion, is there anything we can do to help?
=end
As explained above, the problem is with normalization. If you are okay with a version that just does one-to-one conversion, then that can be produced rather quickly (maybe even over the weekend). But most Vietnamese content, e.g. on the Web, is normalized (NFC), and I guess you'd want to have that, too. But then you also have to be careful with respect to round-tripping, because windows-1258->UTF-8 will be .encode('UTF-8', 'windows-1258').to_nfc or so, but backwards conversion would need special code because neither NFC nor NFD can directly be converted to windows-1258.
A slightly more elaborate version would do one-to-one conversion from windows-1258 to UTF-8, but would convert that kind of data as well as data in NFC back to windows-1258 (but not arbitrarily non-normalized data) back to windows-1258. Such a converter might be relatively easy to produce, or it might be more difficult; I can't say which off the top of my head.
So if you use a normalization library after conversion, that might work out, but it would be somewhat of a special case. Also, when we later introduce a different (more normalizing) converter, that may be seen as a non-backwards-compatible change.
One solution to backwards-compatibility would be to use different encoding labels to differentiate versions of conversion. But this has the problem that in the current state of affairs, it introduces additional "encodings" that are not really different, but just variants produced by different conversions. That's the problem e.g. with the current UTF8-MAC, and I don't want to create more of these.
A more long-term solution would be to introduce a difference between encodings and conversions, where e.g. one could use .encode('windows-1258--non-normalized', 'utf-8') or so to indicate a non-normalized version of conversion. But that would need some more general discussion among the Ruby experts in this field.
So Felix, if you tell me what you need, and we can make sure that it doesn't affect later backwards-compatibility, I might be able to work on something.
        
           Updated by phasis68 (Heesob Park) almost 12 years ago
          Updated by phasis68 (Heesob Park) almost 12 years ago
          
          
        
        
      
      As I know, VISCII(Vietnamese Standard Code for Information Interchange) can round trip UTF-8. So the implementation of the converter between VISCII and UTF-8 might be easy.
I am not sure if a converter between Windows-1258 and VISCII is possible, Windows-1258 can be supported via VISCII.
Windows-1258 <-> VISCII <-> UTF-8
Anyway, it would be nice if ruby supports VISCII encoding.
        
           Updated by duerst (Martin Dürst) almost 12 years ago
          Updated by duerst (Martin Dürst) almost 12 years ago
          
          
        
        
      
      phasis68 (Heesob Park) wrote:
As I know, VISCII(Vietnamese Standard Code for Information Interchange) can round trip UTF-8. So the implementation of the converter between VISCII and UTF-8 might be easy.
Yes, it should be easy. Can you open a separate ticket? I'll give it a try over the weekend.
I am not sure if a converter between Windows-1258 and VISCII is possible, Windows-1258 can be supported via VISCII.
Conversion between Windows-1258 and VISCII is actually as difficult as the conversion between Windows-1258 and NFC-normalized UTF-8, which is the most difficult variant as I have explained above.
        
           Updated by naruse (Yui NARUSE) almost 8 years ago
          Updated by naruse (Yui NARUSE) almost 8 years ago
          
          
        
        
      
      - Target version deleted (2.6)
        
           Updated by JesseJohnson (Jesse Johnson) almost 2 years ago
          Updated by JesseJohnson (Jesse Johnson) almost 2 years ago
          
          
        
        
      
      If I understand correctly this test case should convert correctly and not raise a Encoding::ConverterNotFoundError error.
"\xE3\xEC".force_encoding(Encoding::Windows_1258).encode(Encoding::UTF_8)
        
           Updated by hsbt (Hiroshi SHIBATA) over 1 year ago
          Updated by hsbt (Hiroshi SHIBATA) over 1 year ago
          
          
        
        
      
      - Status changed from Open to Assigned