Project

General

Profile

Actions

Feature #18838

closed

Avoid swallowing regexp escapes in the lexer

Added by andrykonchin (Andrew Konchin) almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:<unknown>]

Description

According to Regexp#source documentation:

Returns the original string of the pattern.
/ab+c/ix.source #=> "ab+c"

Note that escape sequences are retained as is.
/\x20\+/.source  #=> "\\x20\\+"

It works well but backslash (/) is processed in different way by different regexp literal forms.

Examples:

/\//.source # => "/"
%r/\//.source # => "/"
%r{\/}.source # => "\\/"

Expected result - in all the cases result is the same.

Moreover as documentation states - escape sequences are retained as is. So I would say that only %r{} works properly.

The issue was reported here https://github.com/oracle/truffleruby/issues/2569.

Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from Regexp#source behaves inconsistently with / to Avoid swallowing regexp escapes in the lexer
  • ruby -v deleted (3.0.3)
  • Backport deleted (2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN)

In the /\// and %r/\// cases, the regexp source is transformed from \/ to / in the lexer (tokadd_string) before it even hits the parser, let alone the regexp engine. From a regexp perspective, /\// and %r/\// are treated as Regexp.new('/'), and %r{\/} as Regexp.new('\/').

Regexp#source should provide the source of the regexp, not necessarily the source as given in the source code. The statement escape sequences are retained as is refers to Regexp escape sequences, and the \ in /\// and %r/\// is not a regexp escape sequence, but a lexer escape sequence (similar to %/\// or %s/\//). This issue is not related to / specifically, it occurs for most terminators: %r,\,,.source # => ","

Note that in cases where escaping would actually change regexp behavior, the lexer doesn't swallow the escape character: %r$\$$.source # => "\\$"

It's fairly simple to remove this behavior from the lexer just by deleting code:

diff --git a/parse.y b/parse.y
index 167f064b31..523d5a85b3 100644
--- a/parse.y
+++ b/parse.y
@@ -7130,19 +7130,6 @@ tokadd_mbchar(struct parser_params *p, int c)
     return c;
 }

-static inline int
-simple_re_meta(int c)
-{
-    switch (c) {
-      case '$': case '*': case '+': case '.':
-      case '?': case '^': case '|':
-      case ')': case ']': case '}': case '>':
-       return TRUE;
-      default:
-       return FALSE;
-    }
-}
-
 static int
 parser_update_heredoc_indent(struct parser_params *p, int c)
 {
@@ -7277,10 +7264,6 @@ tokadd_string(struct parser_params *p,
                       }
                     }

-                   if (c == term && !simple_re_meta(c)) {
-                       tokadd(p, c);
-                       continue;
-                   }
                    pushback(p, c);
                    if ((c = tokadd_escape(p, enc)) < 0)
                        return -1;

However, it breaks 3 tests in test_regexp.rb: test_source_unescaped, test_source, and test_equal. It also breaks a couple of specs, Literal Regexps supports escaping characters when used as a terminator and Regexp#source will remove escape characters.

Since the current behavior is clearly by design in the tests and specs, I can safely conclude this is not a bug, or at most, it is a minor documentation bug (I'll update the documentation). Switching to feature request. I'll add this the list of tickets to review at the next developer meeting, since while I'm not in favor of making the change, I do think this issue warrants discussion.

Actions #2

Updated by jeremyevans (Jeremy Evans) almost 2 years ago

  • Status changed from Open to Closed

Applied in changeset git|596f4b0d3ab8bc2559a52396d3a29ce62e6a3694.


Document that Regexp#source does not retain lexer escapes

Related to [Feature #18838]

Updated by mame (Yusuke Endoh) almost 2 years ago

Agreed with Jeremy. There is the same "inconsistency" in a string literal, but this is clearly by design.

puts   'foo\'bar'  #=> foo'bar
puts %q'foo\'bar'  #=> foo'bar
puts %q{foo\'bar}  #=> foo\'bar

Updated by andrykonchin (Andrew Konchin) almost 2 years ago

Thank you for clarification.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0