Bug #21528
open
SyntaxError#message may have broken encoding with multibyte source under Prism
Description
Since the introduction of Prism, when parsing Ruby source code that contains multibyte characters, SyntaxError#message can sometimes have invalid encoding.
Here is a reproducible example:
begin
RubyVM::InstructionSequence.compile(<<~CODE, nil, nil, 1)
if a
# 0000000000000ああああああ
#
CODE
rescue SyntaxError => e
$e = e
puts e.message # string contains a multibyte character that is cut off mid-byte. \xE3
# <compiled>:3: syntax errors found
# 1 | if a
# > 2 | # 0000000000000あああああ\xE3 ...
# | ^ expected an `end` to close the conditional clause
# > 3 | #
# | ^ unexpected end-of-input, assuming it is closing the parent top level context
puts e.message.valid_encoding? #=> expected true, but got false
end
This appears to be caused by a truncation process in prism's error message generating that does not consider multibyte character boundaries.
See: The truncation logic around prism_compile.c L10696-L10709
I'm not sure how to correctly fix it due to lack of knowledge about safe byte truncation.
I discovered this issue through irb, which attempts to display source code even when it contains syntax errors. Because irb uses SyntaxError#message
, it raised an ArgumentError: invalid byte sequence in UTF-8
. See: https://github.com/ruby/irb/blob/f60dfa8549f746f69e9a6d160604a7a4974ffac1/lib/irb/ruby-lex.rb#L255-L256
If this is considered an irb issue, I already have a patch for IRB that handles it.
Updated by byroot (Jean Boussier) 19 days ago
I'm not sure how to correctly fix it due to lack of knowledge about safe byte truncation.
If that helps, I did something similar in ruby/json: https://github.com/ruby/json/blob/3090a63a956c30e6d30d93fc9667deccd5e31327/ext/json/ext/parser/parser.c#L456-L462 / https://github.com/ruby/json/commit/e144793b7226c2df75c414749d6f87ab7fcf4dce
It's not perfect as it doesn't consider grapheme clusters, but at least it ensures the included snippet is valid UTF-8.
Updated by Earlopain (Earlopain _) 19 days ago
Something like this perhaps https://github.com/ruby/ruby/pull/14094. Also doesn't consider grapheme clusters.
I believe truncation from the left is irrelevant here, since the method is only supposed to be called with valid utf8 (guarded at the two relevant method calls of pm_parse_errors_format
).
Updated by ko1 (Koichi Sasada) 2 days ago
- Assignee set to prism