Bug #20039: Matching US-ASCII string to copied UTF-8 Regexp causes invalid multibyte character error - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20039

closed

Matching US-ASCII string to copied UTF-8 Regexp causes invalid multibyte character error

Bug #20039: Matching US-ASCII string to copied UTF-8 Regexp causes invalid multibyte character error

Added by dbrown9@gmail.com (Dustin Brown) almost 2 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.3.0dev (2023-12-03 master 85bc80a)

Backport:

3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED

[ruby-core:115588]

Description

Matching a US-ASCII string to a UTF-8 encoded regexp with multibyte characters works as expected.

re = Regexp.new("\u2018".encode("UTF-8"))
"".encode("US-ASCII").match?(re) 

=> false

However, if that regexp is used to initialize a new regexp, the comparison fails with a Invalid mutibyte character error.

re = Regexp.new("\u2018".encode("UTF-8"))
"".encode("US-ASCII").match?(Regexp.new(re))

=> ArgumentError: regexp preprocess failed: invalid multibyte character

After a bunch of digging, I discovered that this error was due to the fixed encoding flag not being copied over from the original regexp. This pull request address the issue by copying the fixed encoding and no encoding flags during reg_copy.

Ref: https://github.com/ruby/ruby/pull/9120

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#1

Description updated (diff)

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2

Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#3

Status changed from Open to Closed

Applied in changeset git|d89280e8bf6496aa83326b5f9c293724bd1cc1e9.

Copy encoding flags when copying a regex [Bug #20039]

:bug: Fixes Bug #20039

When a Regexp is initialized with another Regexp, we simply copy the
properties from the original. However, the flags on the original were
not being copied correctly. This caused an issue when the original had
multibyte characters and was being compared with an ASCII string.
Without the forced encoding flag (KCODE_FIXED) transferred on to the
new Regexp, the comparison would fail. See the included test for an
example.

Co-authored-by: Nobuyoshi Nakada nobu@ruby-lang.org

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20039

Matching US-ASCII string to copied UTF-8 Regexp causes invalid multibyte character error

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#1

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#3

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20039

Matching US-ASCII string to copied UTF-8 Regexp causes invalid multibyte character error

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago ActionsCopy link #1

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago ActionsCopy link #2

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago ActionsCopy link #3

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#1

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2

Updated by dbrown9@gmail.com (Dustin Brown) almost 2 years ago Actions
Copy link
#3