Bug #15210

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago

 Hi everyone working on the ruby trunk, 

 I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. 

 We import some CSV from paypal. 
 They now include a BOM in front of their UTF-8 encoded CSV data. 
 This BOM is making some troubles. 

 I believe this to be a bug in how byte data is converted to the ruby internal String representation. 

 There is a workaround, but this needs to be documented: 


 But I'm asking for to improve the UTF-BOM handling: 
 - The BOM is only used for transfer encoding at the byte stream level. 
 - The BOM MUST NOT be part of the String in internal representation. 


 BTW: stdlib::CSV chokes on the BOM 

 I'd like to add some code for a workaround: 

 class `class String 

     # delete UTF Byte Order Mark from string 
     # returns self (even if no bom was found, contrary to delete_prefix!) 
     # NOTE: use with care: better remove the bom when reading the file 
     def delete_bom! 
         raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 
         return self 

     # returns a copy of string with UTF Byte Order Mark deleted from string 
     def delete_bom