Project

General

Profile

Actions

Misc #20407

closed

Question about applying encoding modifier to an interpolated Regexp

Added by andrykonchin (Andrew Konchin) 8 months ago. Updated 5 months ago.

Status:
Closed
Assignee:
-
[ruby-core:117431]

Description

I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.

Examples #1

# encoding: us-ascii

# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

Example #2

# encoding: utf-8

# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:

# encoding: us-ascii

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP
# encoding: utf-8

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.

Looking at the following example:

# encoding: us-ascii

# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding                                 # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding              # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding                                     # ASCII-8BIT

# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding                                # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding             # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding                                    # ASCII-8BIT

we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.

Could you please clarify how it works?


Related issues 2 (2 open0 closed)

Related to Ruby master - Misc #20406: Question about Regexp encoding negotiationOpenActions
Related to Ruby master - Bug #20466: Interpolated regular expressions have different encoding than interpolated stringsOpenActions
Actions #1

Updated by andrykonchin (Andrew Konchin) 8 months ago

  • Description updated (diff)
Actions #2

Updated by andrykonchin (Andrew Konchin) 8 months ago

  • Description updated (diff)
Actions #3

Updated by andrykonchin (Andrew Konchin) 8 months ago

  • Description updated (diff)
Actions #4

Updated by andrykonchin (Andrew Konchin) 8 months ago

  • Description updated (diff)
Actions #5

Updated by Eregon (Benoit Daloze) 8 months ago

  • Related to Misc #20406: Question about Regexp encoding negotiation added
Actions #6

Updated by Eregon (Benoit Daloze) 7 months ago

  • Related to Bug #20466: Interpolated regular expressions have different encoding than interpolated strings added

Updated by naruse (Yui NARUSE) 7 months ago

I checked the related source code especially about rb_reg_preprocess_dregexp. It wrongly calls rb_reg_preprocess with overwriting fixed_enc instead of inheriting it.

It seems to raise error if the resulted encoding of the regexp is other than EUC-JP in this case.
(US-ASCII case also should raise error or show a warning comparing //n's behavior)

I'm still wondering whether we should fix this issue because there is a trade off between compatibility and the merit of this improvement)

Updated by nobu (Nobuyoshi Nakada) 6 months ago

I think:

  1. If a Regexp source string contains non-US-ASCII chars, the source string encoding is honored.
  2. If the source string contains US-ASCII chars only, falls back to
    a. an encoding option if given.
    b. US-ASCII.

Updated by matz (Yukihiro Matsumoto) 6 months ago

I think encoding modifiers for Regexp should be deprecated (and gradually removed), although the bug should be fixed anyway.

Matz.

Updated by naruse (Yui NARUSE) 5 months ago

Since this feature is not widely used and will not be widely used, how do we keep this as is?
After for a while, this feature should be removed like $KCODE and other deprecated encoding features.

Updated by matz (Yukihiro Matsumoto) 5 months ago

  • Status changed from Open to Closed

It seems to be reasonable. Accepted.

Matz.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0