Bug #16842
closed`inspect` prints the UTF-8 character U+0085 (NEXT LINE) verbatim even though it is not printable
Description
The UTF-8 character U+0085 (NEXT LINE) is not printable, but inspect
prints the character verbatim (within double quotation):
0x85.chr(Encoding::UTF_8).match?(/\p{print}/) # => false
0x85.chr(Encoding::UTF_8).inspect
#=> "\"
\""
My understanding is that non-printable characters are not printed verbatim with inspect
:
"\n".match?(/\p{print}/) # => false
"\n".inspect #=> "\"\\n\""
while printable characters are:
"a".match?(/\p{print}/) # => true
"a".inspect # => "\"a\""
I ran the following script, and found that U+0085 is the only character within the range U+0000 to U+FFFF that behaves like this.
def verbatim?(char)
!char.inspect.start_with?(%r{\"\\[a-z]})
end
def printable?(char)
char.match?(/\p{print}/)
end
(0x0000..0xffff).each do |i|
begin
char = i.chr(Encoding::UTF_8)
rescue RangeError
next
end
puts '%#x' % i unless verbatim?(char) == printable?(char)
end
Updated by jeremyevans0 (Jeremy Evans) almost 4 years ago
- Status changed from Open to Assigned
- Assignee set to duerst (Martin Dürst)
Behavior here seems to be dependent on the encoding:
$ LC_ALL=C ruby -e "p 0x85.chr(Encoding::UTF_8).inspect.b"
"\"\\u0085\""
$ LC_ALL=en_US.UTF-8 ruby -e "p 0x85.chr(Encoding::UTF_8).inspect.b"
"\"\xC2\x85\""
I've submitted a pull request to fix the behavior, though the implementation is rather crude: https://github.com/ruby/ruby/pull/4229
@duerst (Martin Dürst) Is there a better fix by handling the unicode properties differently?
Updated by naruse (Yui NARUSE) almost 4 years ago
Why U+0085 is categorized as Print
in Ruby is historically Oniguruma treats as that.
https://moriyoshi.hatenablog.com/entry/20090307/1236410006
I'm neutral about the change, but I want the change should have detailed comment or link to this ticket.
Updated by jeremyevans (Jeremy Evans) over 2 years ago
- Status changed from Assigned to Closed
Applied in changeset git|49517b3bb436456407e0ee099c7442f3ab5ac53d.
Fix inspect for unicode codepoint 0x85
This is an inelegant hack, by manually checking for this specific
code point in rb_str_inspect. Some testing indicates that this is
the only code point affected.
It's possible a better fix would be inside of lower-level encoding
code, such that rb_enc_isprint would return false and not true for
codepoint 0x85.
Fixes [Bug #16842]