Misc #20407: Question about applying encoding modifier to an interpolated Regexp - Ruby master - Ruby Issue Tracking System

Actions

Copy link

Misc #20407

open

« Previous | Next »

Question about applying encoding modifier to an interpolated Regexp

Added by andrykonchin (Andrew Konchin) about 2 months ago. Updated 2 days ago.

Status:

Open

Assignee:

[ruby-core:117431]

Description

I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.

Examples #1

# encoding: us-ascii

# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

Example #2

# encoding: utf-8

# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:

# encoding: us-ascii

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

# encoding: utf-8

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.

Looking at the following example:

# encoding: us-ascii

# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding                                 # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding              # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding                                     # ASCII-8BIT

# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding                                # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding             # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding                                    # ASCII-8BIT

we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.

Could you please clarify how it works?

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Description updated (diff)

Actions

Copy link

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Description updated (diff)

Actions

Copy link

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Description updated (diff)

Actions

Copy link

Updated by andrykonchin (Andrew Konchin) about 1 month ago

Description updated (diff)

Actions

Copy link

Updated by Eregon (Benoit Daloze) about 1 month ago

Related to Misc #20406: Question about Regexp encoding negotiation added

Actions

Copy link

Updated by Eregon (Benoit Daloze) 15 days ago

Related to Bug #20466: Interpolated regular expressions have different encoding than interpolated strings added

Actions

Copy link

#7 [ruby-core:117903]

Updated by naruse (Yui NARUSE) 2 days ago

I checked the related source code especially about rb_reg_preprocess_dregexp. It wrongly calls rb_reg_preprocess with overwriting fixed_enc instead of inheriting it.

It seems to raise error if the resulted encoding of the regexp is other than EUC-JP in this case.
(US-ASCII case also should raise error or show a warning comparing //n's behavior)

I'm still wondering whether we should fix this issue because there is a trade off between compatibility and the merit of this improvement)

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby » Ruby master

Custom queries

Misc #20407

Question about applying encoding modifier to an interpolated Regexp

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Updated by andrykonchin (Andrew Konchin) about 2 months ago

Updated by andrykonchin (Andrew Konchin) about 1 month ago

Updated by Eregon (Benoit Daloze) about 1 month ago

Updated by Eregon (Benoit Daloze) 15 days ago

Updated by naruse (Yui NARUSE) 2 days ago