Project

General

Profile

Actions

Bug #18294

open

error when parsing regexp comment

Added by thyresias (Thierry Lambert) 3 months ago. Updated 29 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [i386-mingw32]
[ruby-core:105972]

Description

The following code generates the error "too short escaped multibyte character"

_re = /
  foo  # \M-ca
/x

Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?

Updated by duerst (Martin Dürst) 2 months ago

thyresias (Thierry Lambert) wrote:

The following code generates the error "too short escaped multibyte character"

_re = /
  foo  # \M-ca
/x

Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?

I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.

Updated by thyresias (Thierry Lambert) 2 months ago

duerst (Martin Dürst) wrote in #note-1:

I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.

You have the same issue with this code, where it knows from the start this is an extended regexp, so I guess you explanation does not hold:

_re = /(?x)
  foo  # \M-ca
/
ruby

Updated by janosch-x (Janosch Müller) 29 days ago

this affects:

  • all String escapes that can be invalid (\x, \u, \u{...}, \M, \C, \c)
  • only invalid escapes (e.g. \x7F is fine)
  • no Regexp-specific escapes such as \p{...}, \g<...>, \k<...>
  • Regexp literals (SyntaxError) and Regexp::new (RegexpError)
  • end-of-line comments as well as comment groups (these don't require x-mode)
  • all Ruby versions

to give an example that is maybe a bit less edge-casy:

/ C:\\[a-z]{5} # e.g. C:\users /x
# =>                      ^
# => invalid Unicode escape (SyntaxError)

the comment handling in regparse.c could probably be changed fairly easily, it only happens here and here. i could take this on with a few pointers.

edit: i think the RegexpError when using Regexp::new is raised by rb_reg_preprocess returning 0, before the string is even passed to the Onigmo parsing code in regparse.c, so it's not yet known at this point which part of the data is a comment and which isn't.

i'm also wondering if the flags here mean that escape sequences in Regexp literals are actually pre-processed by Ruby's main parser? this would make a fix much more involved.

Actions

Also available in: Atom PDF