UTF-8 BOM should be removed from String in internal representation
Hi everyone working on the ruby trunk,
I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data.
We import some CSV from paypal.
They now include a BOM in front of their UTF-8 encoded CSV data.
This BOM is making some troubles.
I believe this to be a bug in how byte data is converted to the ruby internal String representation.
There is a workaround, but this needs to be documented:
But I'm asking for to improve the UTF-BOM handling:
- The BOM is only used for transfer encoding at the byte stream level.
- The BOM MUST NOT be part of the String in internal representation.
BTW: stdlib::CSV chokes on the BOM
I'd like to add some code for a workaround:
class String # delete UTF Byte Order Mark from string # returns self (even if no bom was found, contrary to delete_prefix!) # NOTE: use with care: better remove the bom when reading the file def delete_bom! raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8 delete_prefix!("\xEF\xBB\xBF") return self end # returns a copy of string with UTF Byte Order Mark deleted from string def delete_bom dup.delete_bom! end end