Project

General

Profile

Bug #16402

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Added by PikachuEXE (Pikachu Leung) about 2 months ago. Updated about 2 months ago.

Status:
Third Party's Issue
Priority:
Normal
Target version:
-
ruby -v:
ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]
[ruby-core:96118]

Description

$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt 
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
    1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)

No error should be raised, just like when comparing with string without BOM

$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false

History

Updated by shyouhei (Shyouhei Urabe) about 2 months ago

  • Status changed from Open to Feedback

I bet your locale setting is UTF-8? Hence the error message. You have to be explicit then. File.read("u.txt", mode: "rb:bom|utf-16") Would give you a correct String instance.

Updated by PikachuEXE (Pikachu Leung) about 2 months ago

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

text = HTTPClient.new.get(
  data_feed_url,
  follow_redirect: true,
).tap do |response|
  raise "Unexpected response code: #{response.status}" unless response.ok?
end.body

Updated by shyouhei (Shyouhei Urabe) about 2 months ago

  • Assignee set to nahi (Hiroshi Nakamura)
  • Status changed from Feedback to Third Party's Issue

PikachuEXE (Pikachu Leung) wrote:

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.

Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?

Also available in: Atom PDF