Bug #14367
closedWrong interpretation of backslash C in regexp literals
Added by shyouhei (Shyouhei Urabe) over 8 years ago. Updated about 5 years ago.
Description
Updated by Hanmac (Hans Mackowiak) over 8 years ago
Actions
#1
[ruby-core:84904]
Updated by shyouhei (Shyouhei Urabe) over 8 years ago
Actions
#2
[ruby-core:84905]
Hanmac (Hans Mackowiak) wrote:
the problem is this:
No, I believe that isn't the problem. For instance /\c\x7F/ works.
% LC_ALL=C ruby -ve 'p(/\c\x7F/ =~ "\c\x7F")'
ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]
0
EDIT: this works
Yeah, that's why I titled this issue a "wrong interpretation of backslash C in regexp literals". This is about /...\c.../.
Updated by shyouhei (Shyouhei Urabe) about 6 years ago
Actions
#3
[ruby-core:97994]
Can I have any answer for my question ("Is this intentional?")?
Updated by naruse (Yui NARUSE) about 6 years ago
Actions
#4
[ruby-core:98181]
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
Actions
#5
[ruby-core:103807]
The behavior appears not to be intentional. This is a bug related to the fact that Ruby uses a recursive algorithm for strings (read_escape) but not for regexps (tokadd_escape). I've submitted a pull request to have control/meta handling for regexps use the same recursive algorithm used for strings, which fixes this issue: https://github.com/ruby/ruby/pull/4495
Updated by jeremyevans (Jeremy Evans) about 5 years ago
Actions
#6
- Status changed from Open to Closed
Applied in changeset git|11ae581a4a7f5d5f5ec6378872eab8f25381b1b9.
Fix handling of control/meta escapes in literal regexps
Ruby uses a recursive algorithm for handling control/meta escapes
in strings (read_escape). However, the equivalent code for regexps
(tokadd_escape) in did not use a recursive algorithm. Due to this,
Handling of control/meta escapes in regexp did not have the same
behavior as in strings, leading to behavior such as the following
returning nil:
Switch the code for handling \c, \C and \M in literal regexps to
use the same code as for strings (read_escape), to keep behavior
consistent between the two.
Fixes [Bug #14367]
Updated by nobu (Nobuyoshi Nakada) about 5 years ago
Actions
#7
[ruby-core:103814]
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
Actions
#8
[ruby-core:103815]
nobu (Nobuyoshi Nakada) wrote in #note-7:
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
Actions
#9
[ruby-core:103836]
jeremyevans0 (Jeremy Evans) wrote in #note-8:
nobu (Nobuyoshi Nakada) wrote in #note-7:
Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.
The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?
My previous statement was incorrect. The reason it worked before is that \c behavior in regexps was wrong and did not result in the 8-bit character it should have. If you used a character resulting in a high bit, you did get the same error:
$ LANG=en_US.UTF-8 ruby -vce '/\M-a/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: too short escaped multibyte character: /\M-a/
-e:1: warning: possibly useless use of a literal in void context
You would also get an error if you created a regexp using a string instead of using a literal regexp:
$ LANG=en_US.UTF-8 ruby -ve '/#{s="\c\xff"}/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: warning: possibly useless use of a literal in void context
-e:1:in `<main>': invalid multibyte character (ArgumentError)
So I don't think anything is broken on UTF-8 (or other encodings). Before, it should have raised an error and it didn't because the incorrect algorithm resulted in the wrong character. Now it raises an error as it should.
Updated by mame (Yusuke Endoh) over 4 years ago
Actions
#10
- Related to Bug #18449: Bug in 3.1 regexp literals with \c added