Bug #15210: UTF-8 BOM should be removed from String in internal representation - Ruby - Ruby Issue Tracking System

Bug #15210

Updated by nobu (Nobuyoshi Nakada) over 6 years ago

 Hi everyone working on the ruby trunk, 

 I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. 

 We import some CSV from paypal. 
 They now include a BOM in front of their UTF-8 encoded CSV data. 
 This BOM is making some troubles. 

 I believe this to be a bug in how byte data is converted to the ruby internal String representation. 

 There is a workaround, but this needs to be documented: 
 ```ruby 
 IO.read(mode:'r:BOM|UTF-8') 
 ``` 
     `IO.read(mode:'r:BOM|UTF-8')` 


 --- 

 But I'm asking for to improve the UTF-BOM handling: 
 - The BOM is only used for transfer encoding at the byte stream level. 
 - The BOM MUST NOT be part of the String in internal representation. 


 --- 

 BTW: stdlib::CSV chokes on the BOM 

 I'd like to add some code for a workaround: 


 ```ruby 
 class `class String 

     # delete UTF Byte Order Mark from string 
     # returns self (even if no bom was found, contrary to delete_prefix!) 
     # NOTE: use with care: better remove the bom when reading the file 
     def delete_bom! 
         raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 
         delete_prefix!("\xEF\xBB\xBF") 
         return self 
     end 


     # returns a copy of string with UTF Byte Order Mark deleted from string 
     def delete_bom 
         dup.delete_bom! 
     end 

 end 
 ``` 

 ` 

 --- 
 ~eike

Back

Project

General

Profile

Ruby

Bug #15210