Project

General

Profile

Actions

Bug #15210

closed

UTF-8 BOM should be removed from String in internal representation

Added by foonlyboy (Eike Dierks) almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
docs
Target version:
-
[ruby-core:89298]

Description

Hi everyone working on the ruby trunk,

I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data.

We import some CSV from paypal.
They now include a BOM in front of their UTF-8 encoded CSV data.
This BOM is making some troubles.

I believe this to be a bug in how byte data is converted to the ruby internal String representation.

There is a workaround, but this needs to be documented:

IO.read(mode:'r:BOM|UTF-8')

But I'm asking for to improve the UTF-BOM handling:

  • The BOM is only used for transfer encoding at the byte stream level.
  • The BOM MUST NOT be part of the String in internal representation.

BTW: stdlib::CSV chokes on the BOM

I'd like to add some code for a workaround:

class String

    # delete UTF Byte Order Mark from string
    # returns self (even if no bom was found, contrary to delete_prefix!)
    # NOTE: use with care: better remove the bom when reading the file
    def delete_bom!
        raise 'encoding is not UTF-8' unless self.encoding == Encoding::UTF_8
        delete_prefix!("\xEF\xBB\xBF")
        return self
    end


    # returns a copy of string with UTF Byte Order Mark deleted from string
    def delete_bom
        dup.delete_bom!
    end

end

~eike


Related issues

Related to Ruby master - Bug #15908: Detecting BOM with non-UTF encodingClosedActions
Actions

Also available in: Atom PDF