Project

General

Profile

Actions

Bug #14367

closed

Wrong interpretation of backslash C in regexp literals

Added by shyouhei (Shyouhei Urabe) over 3 years ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
[ruby-core:84900]
Tags:

Description

Following ruby code returns nil.

% LC_ALL=C ruby -ve 'p(/\c\xFF/ =~ "\c\xFF")'
ruby 2.6.0dev (2018-01-16 trunk 61875) [x86_64-darwin15]
nil

Is this intentional?

Updated by Hanmac (Hans Mackowiak) over 3 years ago

the problem is this:

/\c\xFF/.source == "\\c\\xFF"

which is already escaped

you might want this:

/#{"\c\xFF"}/ == /ƒ/

or use this:

Regexp.compile("\c\xFF")

PS: it is correct that i get this?

"\c\xFF" ==  "\x9F" #=> true

EDIT: this works

/\x9F/ =~ "\c\xFF" #=> 0

Updated by shyouhei (Shyouhei Urabe) over 3 years ago

Hanmac (Hans Mackowiak) wrote:

the problem is this:

/\c\xFF/.source == "\\c\\xFF"

No, I believe that isn't the problem. For instance /\c\x7F/ works.

% LC_ALL=C ruby -ve 'p(/\c\x7F/ =~ "\c\x7F")'
ruby 2.0.0p648 (2015-12-16 revision 53162) [universal.x86_64-darwin15]
0

EDIT: this works

/\x9F/ =~ "\c\xFF" #=> 0

Yeah, that's why I titled this issue a "wrong interpretation of backslash C in regexp literals". This is about /...\c.../.

Updated by shyouhei (Shyouhei Urabe) about 1 year ago

Can I have any answer for my question ("Is this intentional?")?

Updated by naruse (Yui NARUSE) about 1 year ago

It looks inconsistency handling between regexp and Ruby's for \c\xff:

%  LC_ALL=C ruby -ve 'p (/\c\xff/ =~ "\x1f")'
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
0

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

The behavior appears not to be intentional. This is a bug related to the fact that Ruby uses a recursive algorithm for strings (read_escape) but not for regexps (tokadd_escape). I've submitted a pull request to have control/meta handling for regexps use the same recursive algorithm used for strings, which fixes this issue: https://github.com/ruby/ruby/pull/4495

Actions #6

Updated by jeremyevans (Jeremy Evans) about 1 month ago

  • Status changed from Open to Closed

Applied in changeset git|11ae581a4a7f5d5f5ec6378872eab8f25381b1b9.


Fix handling of control/meta escapes in literal regexps

Ruby uses a recursive algorithm for handling control/meta escapes
in strings (read_escape). However, the equivalent code for regexps
(tokadd_escape) in did not use a recursive algorithm. Due to this,
Handling of control/meta escapes in regexp did not have the same
behavior as in strings, leading to behavior such as the following
returning nil:

/\c\xFF/ =~ "\c\xFF"

Switch the code for handling \c, \C and \M in literal regexps to
use the same code as for strings (read_escape), to keep behavior
consistent between the two.

Fixes [Bug #14367]

Updated by nobu (Nobuyoshi Nakada) about 1 month ago

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

nobu (Nobuyoshi Nakada) wrote in #note-7:

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

jeremyevans0 (Jeremy Evans) wrote in #note-8:

nobu (Nobuyoshi Nakada) wrote in #note-7:

Agree that the previous behavior might not be intentional, but 11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 also seems something broken on other than US-ASCII encoding.

$ LANG=en_US.UTF-8 ./ruby -vce '/\c\xFF/'
ruby 3.1.0dev (2021-05-13T01:55:43Z master 11ae581a4a) [x86_64-darwin19]
-e:1: invalid multibyte escape: /\x9F/
-e:1: warning: possibly useless use of a literal in void context

The previous behavior also ended up with a regexp which matches a 8-bit character, so maybe Ruby should have given the same error before? Alternatively, I can revert if that is better?

My previous statement was incorrect. The reason it worked before is that \c behavior in regexps was wrong and did not result in the 8-bit character it should have. If you used a character resulting in a high bit, you did get the same error:

$ LANG=en_US.UTF-8 ruby -vce '/\M-a/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: too short escaped multibyte character: /\M-a/
-e:1: warning: possibly useless use of a literal in void context

You would also get an error if you created a regexp using a string instead of using a literal regexp:

$ LANG=en_US.UTF-8 ruby -ve '/#{s="\c\xff"}/'
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-openbsd]
-e:1: warning: possibly useless use of a literal in void context
-e:1:in `<main>': invalid multibyte character (ArgumentError)

So I don't think anything is broken on UTF-8 (or other encodings). Before, it should have raised an error and it didn't because the incorrect algorithm resulted in the wrong character. Now it raises an error as it should.

Actions

Also available in: Atom PDF