Project

General

Profile

Actions

Bug #21709

open

Inconsistent encoding by Regexp.escape

Bug #21709: Inconsistent encoding by Regexp.escape

Added by thyresias (Thierry Lambert) about 19 hours ago. Updated about 12 hours ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt]
[ruby-core:123894]

Description

%w(foo être).each do |s|
  puts "string: #{s.inspect} -> #{s.encoding}"
  puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}"
end

Output:

string: "foo" -> UTF-8
escaped: "foo" -> US-ASCII
string: "être" -> UTF-8
escaped: "être" -> UTF-8

The result should always match the encoding of the argument.

Updated by jeremyevans0 (Jeremy Evans) about 17 hours ago Actions #1 [ruby-core:123895]

  • Status changed from Open to Feedback

This is not a bug, it is deliberate behavior for ASCII-only strings in rb_reg_quote (internal function called by Regexp.escape):

    if (ascii_only) {
        rb_enc_associate(tmp, rb_usascii_encoding());
    }

US-ASCII strings will be automatically converted to UTF-8 if necessary:

("foo".encode("US-ASCII") + "\u1234").encoding
# => #<Encoding:UTF-8>

Does this behavior cause any problems in your application?

Updated by thyresias (Thierry Lambert) about 16 hours ago Actions #2 [ruby-core:123896]

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Updated by jeremyevans0 (Jeremy Evans) about 15 hours ago Actions #3 [ruby-core:123897]

  • Status changed from Feedback to Open

thyresias (Thierry Lambert) wrote in #note-2:

Does this behavior cause any problems in your application?

Yes:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError)

Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with Regexp.escape. You can trigger the issue without Regexp.escape:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/
# encoding mismatch in dynamic regexp : US-ASCII and UTF-8

It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8.

You get a different error without that unicode character at the end:

re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/
# invalid character property name {In_Arabic}: /\p{In_Arabic}/

Using Regexp.new instead of a literal Regexp may work around the issue:

search_text = "foo"
s_search = Regexp.escape(search_text)
re_prefix = /\p{In_Arabic}.+ /
s_search.prepend re_prefix.source
_re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}")

Updated by thyresias (Thierry Lambert) about 13 hours ago Actions #4 [ruby-core:123898]

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

Updated by jeremyevans0 (Jeremy Evans) about 12 hours ago Actions #5 [ruby-core:123899]

thyresias (Thierry Lambert) wrote in #note-4:

Ok for the workaround, but don't you think all this is inconsistent?
For me, it's a bug, not a feature. ^_^

I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in Regexp.escape.

In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having Regexp.escape return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases:

# Literal Symbol
$ ruby -e "p :a.encoding"
#<Encoding:US-ASCII>

# Array#join
$ ruby -e "p [].join.encoding"
#<Encoding:US-ASCII>

# Literal Regexp
$ ruby -e "p //.encoding"
#<Encoding:US-ASCII>
Actions

Also available in: PDF Atom