Project

General

Profile

Actions

Bug #18294

closed

error when parsing regexp comment

Added by thyresias (Thierry Lambert) 11 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [i386-mingw32]
[ruby-core:105972]

Description

The following code generates the error "too short escaped multibyte character"

_re = /
  foo  # \M-ca
/x

Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?

Updated by duerst (Martin Dürst) 11 months ago

thyresias (Thierry Lambert) wrote:

The following code generates the error "too short escaped multibyte character"

_re = /
  foo  # \M-ca
/x

Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?

I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.

Updated by thyresias (Thierry Lambert) 11 months ago

duerst (Martin Dürst) wrote in #note-1:

I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.

You have the same issue with this code, where it knows from the start this is an extended regexp, so I guess you explanation does not hold:

_re = /(?x)
  foo  # \M-ca
/
ruby

Updated by janosch-x (Janosch Müller) 9 months ago

this affects:

  • all String escapes that can be invalid (\x, \u, \u{...}, \M, \C, \c)
  • only invalid escapes (e.g. \x7F is fine)
  • no Regexp-specific escapes such as \p{...}, \g<...>, \k<...>
  • Regexp literals (SyntaxError) and Regexp::new (RegexpError)
  • end-of-line comments as well as comment groups (these don't require x-mode)
  • all Ruby versions

to give an example that is maybe a bit less edge-casy:

/ C:\\[a-z]{5} # e.g. C:\users /x
# =>                      ^
# => invalid Unicode escape (SyntaxError)

the comment handling in regparse.c could probably be changed fairly easily, it only happens here and here. i could take this on with a few pointers.

edit: i think the RegexpError when using Regexp::new is raised by rb_reg_preprocess returning 0, before the string is even passed to the Onigmo parsing code in regparse.c, so it's not yet known at this point which part of the data is a comment and which isn't.

i'm also wondering if the flags here mean that escape sequences in Regexp literals are actually pre-processed by Ruby's main parser? this would make a fix much more involved.

Updated by matz (Yukihiro Matsumoto) 8 months ago

I admit this is a bug and it should be fixed. But implementation-wise, it's difficult to fix. Considering the (small) impact of the bug, its priority is low.
We will fix it but it could take a long time.

Matz.

Updated by thyresias (Thierry Lambert) 8 months ago

I understand it is not easy to fix, and I sure can live with it.
ありがとうMatzさん ^_^

Updated by jeremyevans0 (Jeremy Evans) 6 months ago

I've submitted a pull request to fix this: https://github.com/ruby/ruby/pull/5721

The basic approach is skip the parse.y checks for regexps, because regexp does the same checks. Modify the regexp code to pass the regexp options to a couple of internal functions, and in the function that handles the unescaping, recognize # and ignore characters until the end of the line. This becomes complicated, because # is allowed as a regular, non-comment character, inside a character class []. So attempt to handle that.

Additionally, I found this issue is not limited to extended regexps, it affects all regexps using (?# comments. So try to handle those comments as well by ignoring content inside them.

I'm not sure the pull request handles all cases, and I'm also not sure it doesn't introduce regressions. It would great if a more knowledgeable committer could review it.

The patch is kind of a hack. A better fix would be to integrate the unescaping code with the regexp parsing code, instead of trying to unescape in a first pass, before parsing in a later pass. This would allow the unescaping to be aware of the regexp parser state, making it simple to avoid unescaping when inside a regexp comment.

Actions #7

Updated by jeremyevans (Jeremy Evans) 4 months ago

  • Status changed from Open to Closed

Applied in changeset git|ec3542229b29ec93062e9d90e877ea29d3c19472.


Ignore invalid escapes in regexp comments

Invalid escapes are handled at multiple levels. The first level
is in parse.y, so skip invalid unicode escape checks for regexps
in parse.y.

Make rb_reg_preprocess and unescape_nonascii accept the regexp
options. In unescape_nonascii, if the regexp is an extended
regexp, when "#" is encountered, ignore all characters until the
end of line or end of regexp.

Unfortunately, in extended regexps, you can use "#" as a non-comment
character inside a character class, so also parse "[" and "]"
specially for extended regexps, and only skip comments if "#" is
not inside a character class. Handle nested character classes as well.

This issue doesn't just affect extended regexps, it also affects
"(#?" comments inside all regexps. So for those comments, scan
until trailing ")" and ignore content inside.

I'm not sure if there are other corner cases not handled. A
better fix would be to redesign the regexp parser so that it
unescaped during parsing instead of before parsing, so you already
know the current parsing state.

Fixes [Bug #18294]

Co-authored-by: Nobuyoshi Nakada

Actions

Also available in: Atom PDF