Project

General

Profile

Bug #13216

Possible unexpected behaviour reading string starting with a byte order mark

Added by gabrielgiordano (Gabriel Giordano) almost 3 years ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
[ruby-core:79542]

Description

Maybe the comparison between symbols has an unexpected behaviour. Tested with ruby 2.4.0

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes'
239
187
191
105
100

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.bytes'
105
100

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym'
id

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym'
id

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym == :id' 
false

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym == :id'
true

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ï

History

Updated by shyouhei (Shyouhei Urabe) almost 3 years ago

  • Description updated (diff)

Hello.

Gabriel Giordano wrote:

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes'
239
187
191
105
100

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.bytes'

105
100

These two are as expected, aren't they?

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym'
id

I think it's the puts method that eats the BOM.

% echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym.to_s.dump'
"\uFEFFid"

This symbol actually includes U+FEFF, which is normally invisible in the middle of a string.

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym'
id

This is OK I believe.

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym == :id'

false

Given the symbol generated by reading stdin does contain U+FEFF, this is natural.

$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym == :id'

true

No problem here.

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ï

This IS weird. Smells like a bug to me.


So all but the last one are working well (at least seems to me). The last one needs more inspection.

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago

Shyouhei Urabe wrote:

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ï

This IS weird. Smells like a bug to me.

Not a bug.

pack("U") packs just one codepoint, and U+00EF is LATIN SMALL LETTER I WITH DIAERESIS, which is the printed exactly.

$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U*")'
id
#3

Updated by jeremyevans0 (Jeremy Evans) 4 months ago

  • Status changed from Open to Closed

Also available in: Atom PDF