Project

General

Profile

Actions

Misc #20407

closed

Question about applying encoding modifier to an interpolated Regexp

Added by andrykonchin (Andrew Konchin) 8 months ago. Updated 5 months ago.

Status:
Closed
Assignee:
-
[ruby-core:117431]

Description

I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.

Examples #1

# encoding: us-ascii

# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

Example #2

# encoding: utf-8

# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:

# encoding: us-ascii

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP
# encoding: utf-8

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.

Looking at the following example:

# encoding: us-ascii

# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding                                 # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding              # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding                                     # ASCII-8BIT

# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding                                # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding             # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding                                    # ASCII-8BIT

we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.

Could you please clarify how it works?


Related issues 2 (2 open0 closed)

Related to Ruby master - Misc #20406: Question about Regexp encoding negotiationOpenActions
Related to Ruby master - Bug #20466: Interpolated regular expressions have different encoding than interpolated stringsOpenActions
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0