Bug #19455
closedRuby 3.2: wrong Regexp encoding with non-ASCII comments
Description
comments and comment groups don't trigger the correct Regexp#encoding
on Ruby 3.2 anymore:
# ruby 3.1
/#a/x.encoding # => #<Encoding:US-ASCII> # OK
/(?#a)/.encoding # => #<Encoding:US-ASCII> # OK
/#ü/x.encoding # => #<Encoding:UTF-8> # OK
/(?#ü)/.encoding # => #<Encoding:UTF-8> # OK
# ruby 3.2
/#a/x.encoding # => #<Encoding:US-ASCII> # OK
/(?#a)/.encoding # => #<Encoding:US-ASCII> # OK
/#ü/x.encoding # => #<Encoding:US-ASCII> # BUG
/(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG
/#ü/x.inspect # => "/#\\xC3\\xBC/x"
/(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/"
# bug is hidden if there are non-ascii chars outside comments
/ü#ü/x.encoding # => #<Encoding:UTF-8>
/ü(?#ü)/.encoding # => #<Encoding:UTF-8>
i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472#diff-c3675fa319803b2f5a775defa40694edb9a761baa3a54fa78e1fdef8f918cc7cR2837-R2890
Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago
I'm not sure that this a bug. If all non-comment characters considered in the regexp are in the US-ASCII range, it seems reasonable for US-ASCII to be used as the regexp encoding. I'll add this ticket to the next developer meeting and see what other committers think.
Updated by mame (Yusuke Endoh) almost 2 years ago
@janosch-x Do you have any specific problem with this change? For example, a string that used to match no longer matches, or vice versa.
Updated by janosch-x (Janosch Müller) almost 2 years ago
i don't have a problem with this myself and the matching behavior is not affected as far as i can tell.
notable behavioral differences are:
-
/#ü/x.source == '#ü'
used to be true but is now false- this might break some tests or metaprogramming (not very likely IMO)
-
/#{/#ü/x.source}/
now raisesArgumentError
(invalid multibyte character)
Updated by mame (Yusuke Endoh) almost 2 years ago
Discussed at the dev meeting. @matz (Yukihiro Matsumoto) said he would prefer 3.1 behavior if possible (but not high priority). @nobu (Nobuyoshi Nakada) said he would take a look.
Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago
I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/7592
Updated by jeremyevans (Jeremy Evans) over 1 year ago
- Status changed from Open to Closed
Applied in changeset git|a8ba1ddd78544b4bda749051d44f7b2a8a0ec5ff.
Use UTF-8 encoding for literal extended regexps with UTF-8 characters in comments
Fixes [Bug #19455]
Updated by nagachika (Tomoyuki Chikanaga) over 1 year ago
- Backport changed from 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: REQUIRED
Updated by nagachika (Tomoyuki Chikanaga) over 1 year ago
- Backport changed from 3.0: DONTNEED, 3.1: DONTNEED, 3.2: REQUIRED to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONE
ruby_3_2 be09d77b966c7bcc77957927f16cefe66b365495 merged revision(s) a8ba1ddd78544b4bda749051d44f7b2a8a0ec5ff.