Bug #20504


Interpolated string literal in regexp encoding handling

Added by kddnewton (Kevin Newton) about 2 months ago. Updated about 1 month ago.

Target version:


There is some very odd behavior that I'm not sure is intentional or not, so I'm looking for guidance. In here:

# encoding: us-ascii

interp = "\x80"
regexp = /#{interp}/

the regexp variable is a ascii-8bit regular expression with the byte interpolated into the middle. However, if you inline that interpolation:

# encoding: us-ascii

regexp = /#{"\x80"}/

you get a syntax error, saying it's an invalid multi-byte character. I'm not sure what the rule is here, as it seems inconsistent. Is this the correct behavior?

I would prefer if it would create an ascii-8bit regular expression like the first example, which would be consistent.

Updated by Eregon (Benoit Daloze) about 2 months ago

Agreed, the current behavior breaks referential transparency and unexpectedly analyzes string literals inside interpolated parts.
This leads to extra confusion and I would think has no value in real-world usages of interpolated regexps (because it causes an error instead of none).

So I think this is a bug and the implementation should not analyze those parts and consequently the behavior should be the same as with the extra local variable.

Actions #2

Updated by Eregon (Benoit Daloze) about 2 months ago

  • Tracker changed from Misc to Bug
  • Backport set to 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN

Updated by kddnewton (Kevin Newton) about 2 months ago

I'm fine with it analyzing the string literals, I would just prefer it take the same codepath as the interpolated variable case, in which it would produce an ascii-8bit regular expression as opposed to raising an error.

Updated by mame (Yusuke Endoh) about 1 month ago

Discussed at the dev meeting, and @matz (Yukihiro Matsumoto) said /#{"\x80"}/ should not raise a SyntaxError but return a binary encoded regexp object.


Also available in: Atom PDF