Misc #20407
closedQuestion about applying encoding modifier to an interpolated Regexp
Description
I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.
Examples #1
# encoding: us-ascii
# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
Example #2
# encoding: utf-8
# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
In the examples above the e
modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII
without the modifier:
# encoding: us-ascii
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
# encoding: utf-8
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
And the e
modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT
.
Looking at the following example:
# encoding: us-ascii
# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding # ASCII-8BIT
# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding # ASCII-8BIT
we can notice that the e
modifier changes ASCII-8BIT
to EUC-JP
in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/
) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/
). So I assume that the e
modifier could be applied to the Regexp fragments (\xc2\xa1
and \xc2\xa1
) before encoding negotiation and not to the whole result after negotiation.
Could you please clarify how it works?
Updated by andrykonchin (Andrew Konchin) about 1 year ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 1 year ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 1 year ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 1 year ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) about 1 year ago
- Related to Misc #20406: Question about Regexp encoding negotiation added
Updated by Eregon (Benoit Daloze) 12 months ago
- Related to Bug #20466: Interpolated regular expressions have different encoding than interpolated strings added
Updated by naruse (Yui NARUSE) 11 months ago
I checked the related source code especially about rb_reg_preprocess_dregexp. It wrongly calls rb_reg_preprocess with overwriting fixed_enc instead of inheriting it.
It seems to raise error if the resulted encoding of the regexp is other than EUC-JP in this case.
(US-ASCII case also should raise error or show a warning comparing //n's behavior)
I'm still wondering whether we should fix this issue because there is a trade off between compatibility and the merit of this improvement)
Updated by nobu (Nobuyoshi Nakada) 11 months ago
I think:
- If a
Regexp
source string contains non-US-ASCII chars, the source string encoding is honored. - If the source string contains US-ASCII chars only, falls back to
a. an encoding option if given.
b. US-ASCII.
Updated by matz (Yukihiro Matsumoto) 11 months ago
I think encoding modifiers for Regexp should be deprecated (and gradually removed), although the bug should be fixed anyway.
Matz.
Updated by naruse (Yui NARUSE) 10 months ago
Since this feature is not widely used and will not be widely used, how do we keep this as is?
After for a while, this feature should be removed like $KCODE and other deprecated encoding features.
Updated by matz (Yukihiro Matsumoto) 10 months ago
- Status changed from Open to Closed
It seems to be reasonable. Accepted.
Matz.