Misc #20407: Question about applying encoding modifier to an interpolated Regexp - Ruby - Ruby Issue Tracking System

Actions

Copy link

Misc #20407

closed

Question about applying encoding modifier to an interpolated Regexp

Misc #20407: Question about applying encoding modifier to an interpolated Regexp

Added by andrykonchin (Andrew Konchin) over 2 years ago. Updated about 2 years ago.

Status:

Closed

Assignee:

[ruby-core:117431]

Description

I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.

Examples #1

# encoding: us-ascii

# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

Example #2

# encoding: utf-8

# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:

# encoding: us-ascii

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

# encoding: utf-8

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.

Looking at the following example:

# encoding: us-ascii

# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding                                 # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding              # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding                                     # ASCII-8BIT

# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding                                # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding             # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding                                    # ASCII-8BIT

we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.

Could you please clarify how it works?

Related issues 2 (2 open — 0 closed)

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#1

Description updated (diff)

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#2

Description updated (diff)

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#3

Description updated (diff)

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#4

Description updated (diff)

Updated by Eregon (Benoit Daloze) over 2 years ago Actions
Copy link
#5

Related to Misc #20406: Question about Regexp encoding negotiation added

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#6

Related to Bug #20466: Interpolated regular expressions have different encoding than interpolated strings added

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#7 [ruby-core:117903]

I checked the related source code especially about rb_reg_preprocess_dregexp. It wrongly calls rb_reg_preprocess with overwriting fixed_enc instead of inheriting it.

It seems to raise error if the resulted encoding of the regexp is other than EUC-JP in this case.
(US-ASCII case also should raise error or show a warning comparing //n's behavior)

I'm still wondering whether we should fix this issue because there is a trade off between compatibility and the merit of this improvement)

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions
Copy link
#8 [ruby-core:118197]

I think:

If a Regexp source string contains non-US-ASCII chars, the source string encoding is honored.
If the source string contains US-ASCII chars only, falls back to
a. an encoding option if given.
b. US-ASCII.

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#9 [ruby-core:118224]

I think encoding modifiers for Regexp should be deprecated (and gradually removed), although the bug should be fixed anyway.

Matz.

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#10 [ruby-core:118549]

Since this feature is not widely used and will not be widely used, how do we keep this as is?
After for a while, this feature should be removed like $KCODE and other deprecated encoding features.

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#11 [ruby-core:118550]

Status changed from Open to Closed

It seems to be reasonable. Accepted.

Matz.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Misc #20407

Question about applying encoding modifier to an interpolated Regexp

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#1

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#2

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#3

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#4

Updated by Eregon (Benoit Daloze) over 2 years ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#6

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions
Copy link
#8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#11 [ruby-core:118550]

	Related to Ruby - Misc #20406: Question about Regexp encoding negotiation	Open		Actions
	Related to Ruby - Bug #20466: Interpolated regular expressions have different encoding than interpolated strings	Open		Actions

Project

General

Profile

Ruby

Custom queries

Misc #20407

Question about applying encoding modifier to an interpolated Regexp

Updated by andrykonchin (Andrew Konchin) over 2 years ago ActionsCopy link #1

Updated by andrykonchin (Andrew Konchin) over 2 years ago ActionsCopy link #2

Updated by andrykonchin (Andrew Konchin) over 2 years ago ActionsCopy link #3

Updated by andrykonchin (Andrew Konchin) over 2 years ago ActionsCopy link #4

Updated by Eregon (Benoit Daloze) over 2 years ago ActionsCopy link #5

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #6

Updated by naruse (Yui NARUSE) about 2 years ago ActionsCopy link #7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) about 2 years ago ActionsCopy link #8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) about 2 years ago ActionsCopy link #9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) about 2 years ago ActionsCopy link #10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) about 2 years ago ActionsCopy link #11 [ruby-core:118550]

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#1

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#2

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#3

Updated by andrykonchin (Andrew Konchin) over 2 years ago Actions
Copy link
#4

Updated by Eregon (Benoit Daloze) over 2 years ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#6

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) about 2 years ago Actions
Copy link
#8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) about 2 years ago Actions
Copy link
#10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) about 2 years ago Actions
Copy link
#11 [ruby-core:118550]