Bug #14367
closedWrong interpretation of backslash C in regexp literals
Description
Following ruby code returns nil.
% LC_ALL=C ruby -ve 'p(/\c\xFF/ =~ "\c\xFF")'
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
nil
Is this intentional?
Updated by Hanmac (Hans Mackowiak) about 7 years ago
the problem is this:
/\c\xFF/.source == "\\c\\xFF"
which is already escaped
you might want this:
/#{"\c\xFF"}/ == /ƒ/
or use this:
Regexp.compile("\c\xFF")
PS: it is correct that i get this?
"\c\xFF" == "\x9F" #=> true
EDIT: this works
/\x9F/ =~ "\c\xFF" #=> 0
Updated by shyouhei (Shyouhei Urabe) about 7 years ago
Hanmac (Hans Mackowiak) wrote:
the problem is this:
/\c\xFF/.source == "\\c\\xFF"
No, I believe that isn't the problem. For instance /\c\x7F/ works.
% LC_ALL=C ruby -ve 'p(/\c\x7F/ =~ "\c\x7F")'
ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]
0
EDIT: this works
/\x9F/ =~ "\c\xFF" #=> 0
Yeah, that's why I titled this issue a "wrong interpretation of backslash C in regexp literals". This is about /...\c.../
.
Updated by shyouhei (Shyouhei Urabe) almost 5 years ago
Can I have any answer for my question ("Is this intentional?")?
Updated by naruse (Yui NARUSE) over 4 years ago
It looks inconsistency handling between regexp and Ruby's for \c\xff
:
% LC_ALL=C ruby -ve 'p (/\c\xff/ =~ "\x1f")'
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
0
Updated by jeremyevans0 (Jeremy Evans) over 3 years ago
The behavior appears not to be intentional. This is a bug related to the fact that Ruby uses a recursive algorithm for strings (read_escape) but not for regexps (tokadd_escape). I've submitted a pull request to have control/meta handling for regexps use the same recursive algorithm used for strings, which fixes this issue: https://github.com/ruby/ruby/pull/4495
Updated by jeremyevans (Jeremy Evans) over 3 years ago
- Status changed from Open to Closed
Applied in changeset git|11ae581a4a7f5d5f5ec6378872eab8f25381b1b9.
Fix handling of control/meta escapes in literal regexps
Ruby uses a recursive algorithm for handling control/meta escapes
in strings (read_escape). However, the equivalent code for regexps
(tokadd_escape) in did not use a recursive algorithm. Due to this,
Handling of control/meta escapes in regexp did not have the same
behavior as in strings, leading to behavior such as the following
returning nil:
/\c\xFF/ =~ "\c\xFF"
Switch the code for handling \c, \C and \M in literal regexps to
use the same code as for strings (read_escape), to keep behavior
consistent between the two.
Fixes [Bug #14367]
Updated by nobu (Nobuyoshi Nakada) over 3 years ago
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context
Updated by jeremyevans0 (Jeremy Evans) over 3 years ago
nobu (Nobuyoshi Nakada) wrote in #note-7:
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/' ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19] -e:1: invalid multibyte escape: /\x9F/ -e:1: warning: possibly useless use of a literal in void context
The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?
Updated by jeremyevans0 (Jeremy Evans) over 3 years ago
jeremyevans0 (Jeremy Evans) wrote in #note-8:
nobu (Nobuyoshi Nakada) wrote in #note-7:
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/' ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19] -e:1: invalid multibyte escape: /\x9F/ -e:1: warning: possibly useless use of a literal in void context
The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?
My previous statement was incorrect. The reason it worked before is that \c
behavior in regexps was wrong and did not result in the 8-bit character it should have. If you used a character resulting in a high bit, you did get the same error:
$ LANG=en_US.UTF-8 ruby -vce '/\M-a/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: too short escaped multibyte character: /\M-a/
-e:1: warning: possibly useless use of a literal in void context
You would also get an error if you created a regexp using a string instead of using a literal regexp:
$ LANG=en_US.UTF-8 ruby -ve '/#{s="\c\xff"}/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: warning: possibly useless use of a literal in void context
-e:1:in `<main>': invalid multibyte character (ArgumentError)
So I don't think anything is broken on UTF-8 (or other encodings). Before, it should have raised an error and it didn't because the incorrect algorithm resulted in the wrong character. Now it raises an error as it should.
Updated by mame (Yusuke Endoh) about 3 years ago
- Related to Bug #18449: Bug in 3.1 regexp literals with \c added