Bug #20504
closedInterpolated string literal in regexp encoding handling
Description
There is some very odd behavior that I'm not sure is intentional or not, so I'm looking for guidance. In here:
# encoding: us-ascii
interp = "\x80"
regexp = /#{interp}/
the regexp
variable is a ascii-8bit regular expression with the byte interpolated into the middle. However, if you inline that interpolation:
# encoding: us-ascii
regexp = /#{"\x80"}/
you get a syntax error, saying it's an invalid multi-byte character. I'm not sure what the rule is here, as it seems inconsistent. Is this the correct behavior?
I would prefer if it would create an ascii-8bit regular expression like the first example, which would be consistent.
Updated by Eregon (Benoit Daloze) 10 months ago
Agreed, the current behavior breaks referential transparency and unexpectedly analyzes string literals inside interpolated parts.
This leads to extra confusion and I would think has no value in real-world usages of interpolated regexps (because it causes an error instead of none).
So I think this is a bug and the implementation should not analyze those parts and consequently the behavior should be the same as with the extra local variable.
Updated by Eregon (Benoit Daloze) 10 months ago
- Tracker changed from Misc to Bug
- Backport set to 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
Updated by kddnewton (Kevin Newton) 10 months ago
I'm fine with it analyzing the string literals, I would just prefer it take the same codepath as the interpolated variable case, in which it would produce an ascii-8bit regular expression as opposed to raising an error.
Updated by mame (Yusuke Endoh) 9 months ago
Discussed at the dev meeting, and @matz (Yukihiro Matsumoto) said /#{"\x80"}/
should not raise a SyntaxError but return a binary encoded regexp object.
Updated by nobu (Nobuyoshi Nakada) 2 months ago
- Status changed from Open to Closed
Applied in changeset git|6bbb470dc77a671c67411a5d3a2564bd0a665a9c.
[Bug #20504] Move dynamic regexp concatenation to iseq compiler