Bug #7282
closedInvalid UTF-8 from emoji allowed through silently
Description
On my system, where the default encoding is UTF-8, the following should not parse:
ruby-2.0.0 -e 'p "Hello, \x96 world!"}'
But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem:
system ~/projects/jruby $ ruby-1.9.3 -e 'p "{"sample": "Hello, \x96 world!"}".encode("UTF-8")'
"{"sample": "Hello, \x96 world!"}"
system ~/projects/jruby $ ruby-2.0.0 -e 'p "{"sample": "Hello, \x96 world!"}".encode("UTF-8")'
"{"sample": "Hello, \x96 world!"}"
Nor does character-walking:
system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}'
Hello, ? world!
Nor does []:
system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"
system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "
system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"
system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "
But the malformed String does get caught by transcoding to UTF-16:
system ~/projects/jruby $ ruby-1.9.3 -e 'p "{"sample": "Hello, \x96 world!"}".encode("UTF-16")'
-e:1:in encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in
'
system ~/projects/jruby $ ruby-2.0.0 -e 'p "{"sample": "Hello, \x96 world!"}".encode("UTF-16")'
-e:1:in encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in
'
Or by doing a simple regexp match:
system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/'
-e:1:in match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in
match'
from -e:1:in `'
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
-e:1:in match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in
match'
from -e:1:in `'
And of course I am ignoring the fact that it should never have parsed to begin with.
This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence.
JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.