Bug #16402
closedUTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"
Description
$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)
No error should be raised, just like when comparing with string without BOM
$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false
Updated by shyouhei (Shyouhei Urabe) about 5 years ago
- Status changed from Open to Feedback
I bet your locale setting is UTF-8? Hence the error message. You have to be explicit then. File.read("u.txt", mode: "rb:bom|utf-16")
Would give you a correct String instance.
Updated by PikachuEXE (Pikachu EXE) about 5 years ago
Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read
text = HTTPClient.new.get(
data_feed_url,
follow_redirect: true,
).tap do |response|
raise "Unexpected response code: #{response.status}" unless response.ok?
end.body
Updated by shyouhei (Shyouhei Urabe) about 5 years ago
- Status changed from Feedback to Third Party's Issue
- Assignee set to nahi (Hiroshi Nakamura)
PikachuEXE (Pikachu Leung) wrote:
Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be usingFile.read
Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.
Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?
Updated by PikachuEXE (Pikachu EXE) about 5 years ago
Submitted a question to httpclient on https://github.com/nahi/httpclient/issues/413