Bug #21528
openSyntaxError#message may have broken encoding with multibyte source under Prism
Description
Since the introduction of Prism, when parsing Ruby source code that contains multibyte characters, SyntaxError#message can sometimes have invalid encoding.
Here is a reproducible example:
begin
RubyVM::InstructionSequence.compile(<<~CODE, nil, nil, 1)
if a
# 0000000000000ああああああ
#
CODE
rescue SyntaxError => e
$e = e
puts e.message # string contains a multibyte character that is cut off mid-byte. \xE3
# <compiled>:3: syntax errors found
# 1 | if a
# > 2 | # 0000000000000あああああ\xE3 ...
# | ^ expected an `end` to close the conditional clause
# > 3 | #
# | ^ unexpected end-of-input, assuming it is closing the parent top level context
puts e.message.valid_encoding? #=> expected true, but got false
end
This appears to be caused by a truncation process in prism's error message generating that does not consider multibyte character boundaries.
See: The truncation logic around prism_compile.c L10696-L10709
I'm not sure how to correctly fix it due to lack of knowledge about safe byte truncation.
I discovered this issue through irb, which attempts to display source code even when it contains syntax errors. Because irb uses SyntaxError#message
, it raised an ArgumentError: invalid byte sequence in UTF-8
. See: https://github.com/ruby/irb/blob/f60dfa8549f746f69e9a6d160604a7a4974ffac1/lib/irb/ruby-lex.rb#L255-L256
If this is considered an irb issue, I already have a patch for IRB that handles it.
No data to display