Feature #18838
closedAvoid swallowing regexp escapes in the lexer
Description
According to Regexp#source
documentation:
Returns the original string of the pattern.
/ab+c/ix.source #=> "ab+c"
Note that escape sequences are retained as is.
/\x20\+/.source #=> "\\x20\\+"
It works well but backslash (/) is processed in different way by different regexp literal forms.
Examples:
/\//.source # => "/"
%r/\//.source # => "/"
%r{\/}.source # => "\\/"
Expected result - in all the cases result is the same.
Moreover as documentation states - escape sequences are retained as is
. So I would say that only %r{}
works properly.
The issue was reported here https://github.com/oracle/truffleruby/issues/2569.
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
- Tracker changed from Bug to Feature
- Subject changed from Regexp#source behaves inconsistently with / to Avoid swallowing regexp escapes in the lexer
- ruby -v deleted (
3.0.3) - Backport deleted (
2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN)
In the /\//
and %r/\//
cases, the regexp source is transformed from \/
to /
in the lexer (tokadd_string
) before it even hits the parser, let alone the regexp engine. From a regexp perspective, /\//
and %r/\//
are treated as Regexp.new('/')
, and %r{\/}
as Regexp.new('\/')
.
Regexp#source should provide the source of the regexp, not necessarily the source as given in the source code. The statement escape sequences are retained as is
refers to Regexp escape sequences, and the \
in /\//
and %r/\//
is not a regexp escape sequence, but a lexer escape sequence (similar to %/\//
or %s/\//
). This issue is not related to /
specifically, it occurs for most terminators: %r,\,,.source # => ","
Note that in cases where escaping would actually change regexp behavior, the lexer doesn't swallow the escape character: %r$\$$.source # => "\\$"
It's fairly simple to remove this behavior from the lexer just by deleting code:
diff --git a/parse.y b/parse.y
index 167f064b31..523d5a85b3 100644
--- a/parse.y
+++ b/parse.y
@@ -7130,19 +7130,6 @@ tokadd_mbchar(struct parser_params *p, int c)
return c;
}
-static inline int
-simple_re_meta(int c)
-{
- switch (c) {
- case '$': case '*': case '+': case '.':
- case '?': case '^': case '|':
- case ')': case ']': case '}': case '>':
- return TRUE;
- default:
- return FALSE;
- }
-}
-
static int
parser_update_heredoc_indent(struct parser_params *p, int c)
{
@@ -7277,10 +7264,6 @@ tokadd_string(struct parser_params *p,
}
}
- if (c == term && !simple_re_meta(c)) {
- tokadd(p, c);
- continue;
- }
pushback(p, c);
if ((c = tokadd_escape(p, enc)) < 0)
return -1;
However, it breaks 3 tests in test_regexp.rb
: test_source_unescaped
, test_source
, and test_equal
. It also breaks a couple of specs, Literal Regexps supports escaping characters when used as a terminator
and Regexp#source will remove escape characters
.
Since the current behavior is clearly by design in the tests and specs, I can safely conclude this is not a bug, or at most, it is a minor documentation bug (I'll update the documentation). Switching to feature request. I'll add this the list of tickets to review at the next developer meeting, since while I'm not in favor of making the change, I do think this issue warrants discussion.
Updated by jeremyevans (Jeremy Evans) over 2 years ago
- Status changed from Open to Closed
Applied in changeset git|596f4b0d3ab8bc2559a52396d3a29ce62e6a3694.
Document that Regexp#source does not retain lexer escapes
Related to [Feature #18838]
Updated by mame (Yusuke Endoh) over 2 years ago
Agreed with Jeremy. There is the same "inconsistency" in a string literal, but this is clearly by design.
puts 'foo\'bar' #=> foo'bar
puts %q'foo\'bar' #=> foo'bar
puts %q{foo\'bar} #=> foo\'bar
Updated by andrykonchin (Andrew Konchin) over 2 years ago
Thank you for clarification.