Bug #16402: UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8" - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #16402

closed

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Bug #16402: UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Added by PikachuEXE (Pikachu EXE) over 6 years ago. Updated over 6 years ago.

Status:

Third Party's Issue

Assignee:

nahi (Hiroshi Nakamura)

Target version:

ruby -v:

ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]

Backport:

2.5: UNKNOWN, 2.6: UNKNOWN

[ruby-core:96118]

Description

$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt 
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)

No error should be raised, just like when comparing with string without BOM

$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#1 [ruby-core:96119]

Status changed from Open to Feedback

I bet your locale setting is UTF-8? Hence the error message. You have to be explicit then. File.read("u.txt", mode: "rb:bom|utf-16") Would give you a correct String instance.

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#2 [ruby-core:96124]

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

text = HTTPClient.new.get(
  data_feed_url,
  follow_redirect: true,
).tap do |response|
  raise "Unexpected response code: #{response.status}" unless response.ok?
end.body

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#3 [ruby-core:96125]

Status changed from Feedback to Third Party's Issue
Assignee set to nahi (Hiroshi Nakamura)

PikachuEXE (Pikachu Leung) wrote:

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.

Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#4 [ruby-core:96147]

Submitted a question to httpclient on https://github.com/nahi/httpclient/issues/413

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #16402

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#1 [ruby-core:96119]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#2 [ruby-core:96124]

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#3 [ruby-core:96125]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#4 [ruby-core:96147]

Project

General

Profile

Ruby

Custom queries

Bug #16402

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Updated by shyouhei (Shyouhei Urabe) over 6 years ago ActionsCopy link #1 [ruby-core:96119]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago ActionsCopy link #2 [ruby-core:96124]

Updated by shyouhei (Shyouhei Urabe) over 6 years ago ActionsCopy link #3 [ruby-core:96125]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago ActionsCopy link #4 [ruby-core:96147]

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#1 [ruby-core:96119]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#2 [ruby-core:96124]

Updated by shyouhei (Shyouhei Urabe) over 6 years ago Actions
Copy link
#3 [ruby-core:96125]

Updated by PikachuEXE (Pikachu EXE) over 6 years ago Actions
Copy link
#4 [ruby-core:96147]