Bug #13216
closedPossible unexpected behaviour reading string starting with a byte order mark
Description
Maybe the comparison between symbols has an unexpected behaviour. Tested with ruby 2.4.0
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes'
239
187
191
105
100
$ echo -n -e 'id' | ruby -e 'puts STDIN.read.bytes'
105
100
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym'
id
$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym'
id
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym == :id'
false
$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym == :id'
true
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ï
Updated by shyouhei (Shyouhei Urabe) about 7 years ago
- Description updated (diff)
Hello.
Gabriel Giordano wrote:
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes'
239
187
191
105
100$ echo -n -e 'id' | ruby -e 'puts STDIN.read.bytes'
105
100
These two are as expected, aren't they?
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym'
id
I think it's the puts
method that eats the BOM.
% echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym.to_s.dump'
"\uFEFFid"
This symbol actually includes U+FEFF, which is normally invisible in the middle of a string.
$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym'
id
This is OK I believe.
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.to_sym == :id'
false
Given the symbol generated by reading stdin does contain U+FEFF, this is natural.
$ echo -n -e 'id' | ruby -e 'puts STDIN.read.to_sym == :id'
true
No problem here.
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ï
This IS weird. Smells like a bug to me.
So all but the last one are working well (at least seems to me). The last one needs more inspection.
Updated by nobu (Nobuyoshi Nakada) about 7 years ago
Shyouhei Urabe wrote:
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U")'
ïThis IS weird. Smells like a bug to me.
Not a bug.
pack("U")
packs just one codepoint, and U+00EF is LATIN SMALL LETTER I WITH DIAERESIS, which is the printed exactly.
$ echo -n -e '\xEF\xBB\xBFid' | ruby -e 'puts STDIN.read.bytes.pack("U*")'
id
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
- Status changed from Open to Closed