Project

General

Profile

Actions

Feature #18788

closed

Support passing Regexp options as String to Regexp.new

Added by janosch-x (Janosch Müller) about 1 month ago. Updated 7 days ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108590]

Description

Current situation

Regexp.new takes an integer as second argument which needs to be ORed together from multiple constants:

Regexp.new('foo', Regexp::IGNORECASE | Regexp::MULTILINE | Regexp::EXTENDED) # => /foo/imx

Any other non-nil value is treated as i flag:

Regexp.new('foo', Object.new) # => /foo/i

Suggestion

Regexp.new should support passing the regexp flags not only as an Integer, but also as a String, like so:

Regexp.new('foo', 'i')   # => /foo/i
Regexp.new('foo', 'imx') # => /foo/imx

# edge cases
Regexp.new('foo', 'iii') # => /foo/i
Regexp.new('foo', '')    # => /foo/

# unsupported flags should probably emit a warning
Regexp.new('foo', 'jmq') # => /foo/m
Regexp.new('foo', '-m')  # => /foo/m

Reasons

  1. The constants are a bit cumbersome to use, particularly when building the regexp from variable data:
def make_regexp(regexp_body, opt_string)
  opt_int = 0
  opt_int |= Regexp::IGNORECASE if opt_string.include?('i')
  opt_int |= Regexp::MULTILINE  if opt_string.include?('m')
  opt_int |= Regexp::EXTENDED   if opt_string.include?('x')

  Regexp.new(regexp_body, opt_int)
end
  1. Passing a String is already silently accepted, and people might get the wrong impression that it works:
Regexp.new('foo', 'i') # => /foo/i

... but it doesn't really work:

Regexp.new('foo', 'x') # => /foo/i

Backwards compatibility

This change would not be fully backwards compatible.

Code that relies on the second argument being a String which does not contain "i" in order to make the Regexp case insensitive would break.

Note: originally I suggested supporting Symbols in the same way as Strings, but removed that in light of the discussion.

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

+1, I've often wanted that.

Updated by sawa (Tsuyoshi Sawada) about 1 month ago

I think this is a good idea, but I also think Regexp.new is not used so often so it does not matter if we have such feature or not. Most of the time, a regex is created using a regex literal. What is the use case of Regexp.new?

Updated by janosch-x (Janosch Müller) about 1 month ago

I agree that the use cases are fairly limited. They are not totally uncommon, though.

Regexp.new is mostly needed for recursion on Ruby and metaprogramming, and is used this way e.g. in rubocop, ruby_parser, and parser.

Sometimes it is used to build Regexps based on input, e.g. in capybara, prawn, and psych. There might also be a few CMSes that allow admins to type in validation patterns.

Many people are also seemingly unaware that Regexp literals support interpolation, or maybe they just find this interpolation hard to read. Either way, they often use Regexp.new instead, passing it an interpolated or concatenated String or Regexp.union output, as can be seen e.g. in css_parser, haml, net-ssh, or uri. (css_parser actually uses the only coincidentally working "i" as a second argument.)

In some other cases, Regexp.new is used to avoid a SyntaxError or a warning on older Rubies, e.g. sinatra does this.

Updated by duerst (Martin Dürst) about 1 month ago

Please don't allow symbols. It may look cute (in some cases), but the options are essentially a set of single letters, and that's not what Symbols are about.

If you really want symbols, please make it something like

Regexp.new('foo', :ignorecase, :multiline, :extend) # => /foo/imx

Even something a bit shorter, such as the following, would be okay with me:

Regexp.new('foo', :ignore, :multi, :ext) # => /foo/imx

Updated by janosch-x (Janosch Müller) about 1 month ago

Please don't allow symbols. It may look cute (in some cases), but the options are essentially a set of single letters, and that's not what Symbols are about.

My reasoning for supporting them was not cuteness but avoiding confusion.

If we change only the processing of Strings, we will have this behavior:

Regexp.new('foo', 'i') # => /foo/i
Regexp.new('foo', 'm') # => /foo/m
Regexp.new('foo', :i)  # => /foo/i # looks like it also works
Regexp.new('foo', :m)  # => /foo/i # slightly surprising

I'm happy to support only Strings, though. In this case we might want to consider raising an ArgumentError or emitting a deprecation warning when a Symbol is passed, or even for anything that is not nil/true/false/Int/String?

please make it something like Regexp.new('foo', :ignorecase, :multiline, :extend) # => /foo/imx

I don't think this is a viable option. Regexp.new already accepts up to 3 arguments. The third one is undocumented as far as I can tell, but it is used in the wild. If a String starting with "n" or "N" is passed as third argument, the Regexp encoding is set to ASCII. Arguably that makes consistency another reason for my proposal.

Updated by nobu (Nobuyoshi Nakada) about 1 month ago

I agree that symbols should be disallowed.
The examples uses arbitrary combinations of flags and the order of the flags doesn't have meanings.
In other words, it is a set of chars and OK for a string.
But a symbol is not such thing.

For the compatibility sake, we'll need migration period to warn anything other than integer and valid string, and then error.

Updated by nobu (Nobuyoshi Nakada) about 1 month ago

Regexp.new("(?#{options}:#{code})")
Regexp.new(code, eval("//#{options}").options)

Updated by janosch-x (Janosch Müller) about 1 month ago

@nobu (Nobuyoshi Nakada)

Thank you for the explanation regarding symbols!

we'll need migration period to warn anything other than inter and valid string

I think the optional third argument should be also deprecated as described in #18797. Otherwise we might still want to allow nil as "stopgap" value for the second argument.

Regexp.new(code, eval("//#{options}").options)

This feels too unsafe for some of the cases mentioned above.

Regexp.new("(?#{options}:#{code})")

This is clever and will work for most of the use cases mentioned above. Just a few minor downsides:

  1. The group options syntax isn't so well-known and universally understood.
  2. A few use cases might need to preserve notation (e.g. codemod).
  3. The result may become verbose in case of recursion or nesting of Regexps.
Actions #9

Updated by janosch-x (Janosch Müller) about 1 month ago

  • Description updated (diff)

Updated by matz (Yukihiro Matsumoto) 11 days ago

Accepted. Unknown flags should raise errors.

Matz.

Actions #11

Updated by nobu (Nobuyoshi Nakada) 7 days ago

  • Status changed from Open to Closed

Applied in changeset git|1e9939dae24db232d6f3693630fa37a382e1a6d7.


[Feature #18788] Support options as String to Regexp.new

Regexp.new now supports passing the regexp flags not only as an
Integer, but also as a `String. Unknown flags raise errors.

Actions

Also available in: Atom PDF