Misc #20407
closedQuestion about applying encoding modifier to an interpolated Regexp
Description
I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.
Examples #1
# encoding: us-ascii
# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
Example #2
# encoding: utf-8
# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding # ASCII-8BIT
In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:
# encoding: us-ascii
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
# encoding: utf-8
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.
Looking at the following example:
# encoding: us-ascii
# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding # ASCII-8BIT
# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding # ASCII-8BIT
we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.
Could you please clarify how it works?