Misc #20407
closedQuestion about applying encoding modifier to an interpolated Regexp
Description
I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.
Examples #1
# encoding: us-ascii
# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
Example #2
# encoding: utf-8
# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
In the examples above the e
modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII
without the modifier:
# encoding: us-ascii
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
# encoding: utf-8
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
And the e
modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT
.
Looking at the following example:
# encoding: us-ascii
# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding # ASCII-8BIT
# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding # ASCII-8BIT
we can notice that the e
modifier changes ASCII-8BIT
to EUC-JP
in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/
) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/
). So I assume that the e
modifier could be applied to the Regexp fragments (\xc2\xa1
and \xc2\xa1
) before encoding negotiation and not to the whole result after negotiation.
Could you please clarify how it works?
Updated by andrykonchin (Andrew Konchin) 10 months ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) 10 months ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) 10 months ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) 10 months ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) 9 months ago
- Related to Misc #20406: Question about Regexp encoding negotiation added
Updated by Eregon (Benoit Daloze) 9 months ago
- Related to Bug #20466: Interpolated regular expressions have different encoding than interpolated strings added
Updated by naruse (Yui NARUSE) 8 months ago
I checked the related source code especially about rb_reg_preprocess_dregexp. It wrongly calls rb_reg_preprocess with overwriting fixed_enc instead of inheriting it.
It seems to raise error if the resulted encoding of the regexp is other than EUC-JP in this case.
(US-ASCII case also should raise error or show a warning comparing //n's behavior)
I'm still wondering whether we should fix this issue because there is a trade off between compatibility and the merit of this improvement)
Updated by nobu (Nobuyoshi Nakada) 7 months ago
I think:
- If a
Regexp
source string contains non-US-ASCII chars, the source string encoding is honored. - If the source string contains US-ASCII chars only, falls back to
a. an encoding option if given.
b. US-ASCII.
Updated by matz (Yukihiro Matsumoto) 7 months ago
I think encoding modifiers for Regexp should be deprecated (and gradually removed), although the bug should be fixed anyway.
Matz.
Updated by naruse (Yui NARUSE) 6 months ago
Since this feature is not widely used and will not be widely used, how do we keep this as is?
After for a while, this feature should be removed like $KCODE and other deprecated encoding features.
Updated by matz (Yukihiro Matsumoto) 6 months ago
- Status changed from Open to Closed
It seems to be reasonable. Accepted.
Matz.