Feature #19720
closedWarning for non-linear Regexps
Description
I believe the best way to solve ReDoS is to ensure all Regexps used in the process are linear.
Using Regexp.timeout = 5.0
or so does not really prevent ReDoS, given enough requests causing that timeout the servers will still be very unresponsive.
To this purpose, we should make it easy to identify non-linear Regexps and fix them.
I suggest we either use
- a performance warning (enabled with
Warning[:performance] = true
, #19538) or - a new regexp warning category (enabled with
Warning[:regexp] = true
).
I think we should warn only once per non-linear Regexp, to avoid too many such warnings.
We could warn as soon as the Regexp is created, or on first match.
On first match might makes more sense for Ruby implementations which compile the Regexp lazily (since that is costly during startup), and also avoids warning for Regexps which are never used (which can be good or bad).
OTOH, if the warning is enabled, we could always compile the Regexp eagerly (or at least checks whether it's linear), and that would then provide a better way to guarantee that all Regexps created so far are linear.
Because warnings are easily customizable, it is also possible to e.g. raise/abort
on such a warning, if one wants to ensure their application does not use a non-linear Regexp and so cannot be vulnerable to ReDoS:
Warning.extend Module.new {
def warn(message, category: nil, **)
raise message if category == :regexp
super
end
}
A regexp warning category seems better for that as it makes it easy to filter by category, if a performance warning one would need to match the message which is less clean.
As a note, TruffleRuby already has a similar warning, as a command-line option:
$ truffleruby --experimental-options --warn-truffle-regex-compile-fallback -e 'Gem'
truffleruby-dev/lib/mri/rubygems/version.rb:176: warning: Regexp /\A\s*([0-9]+(?>\.[0-9a-zA-Z]+)*(-[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*)?)?\s*\z/ at_start=false encoding=US-ASCII requires backtracking and will not match in linear time
truffleruby-dev/lib/mri/rubygems/requirement.rb:105: warning: Regexp /\A\s*(=|!=|>|<|>=|<=|~>)?\s*([0-9]+(?>\.[0-9a-zA-Z]+)*(-[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*)?)\s*\z/ at_start=false encoding=US-ASCII requires backtracking and will not match in linear time
So the warning message could be like
FILE:LINE: warning: Regexp /REGEXP/ requires backtracking and might not match in linear time and might cause ReDoS
or more concise:
FILE:LINE: warning: Regexp /REGEXP/ requires backtracking and might cause ReDoS
Updated by Eregon (Benoit Daloze) over 1 year ago
- Related to Feature #19694: Add Regexp#timeout= setter added
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
We introduced Regexp.linear_time?
in order to check the linear-ness by external utilities.
You can also check dynamically created regexps.
Regexp.prepend Module.new {
def initialize(...)
re = super
raise "nonlinear time regexp: #{re}" unless Regexp.linear_time?(re)
re
end
}
Regexp.new("(a)?\\1") #=> nonlear regexp: (?-mix:(a)?\1) (RuntimeError)
Updated by Eregon (Benoit Daloze) over 1 year ago
nobu (Nobuyoshi Nakada) wrote in #note-2:
We introduced
Regexp.linear_time?
in order to check the linear-ness by external utilities.You can also check dynamically created regexps.
Interesting.
However that initialize monkey-patch does not get called for literal Regexps (ruby -r./reghook.rb -e 'p /(a)?\1/' #=> /(a)?\1/
).
And even if it did, how could the external utility ensure it is loaded early enough, e.g., before RubyGems? It seems very difficult.
So while Regexp.linear_time?
is useful, I think it is far better to have a warning category in core for this.
I think that is the only way to guarantee an application does not use any non-linear Regexp.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
Eregon (Benoit Daloze) wrote in #note-3:
nobu (Nobuyoshi Nakada) wrote in #note-2:
We introduced
Regexp.linear_time?
in order to check the linear-ness by external utilities.You can also check dynamically created regexps.
Interesting.
However that initialize monkey-patch does not get called for literal Regexps (ruby -r./reghook.rb -e 'p /(a)?\1/' #=> /(a)?\1/
).
Literals can be statically checkable.
And even if it did, how could the external utility ensure it is loaded early enough, e.g., before RubyGems? It seems very difficult.
We are thinking about the kind of static linter, such as rubocop.
Of course external utilities can’t check dynamically created regexp, the monkey-patch is for filling this gap.
Updated by duerst (Martin Dürst) over 1 year ago
Introducing such a warning might be a good idea. But there are several issues:
- The warning should only be used when asked for with an option (i.e. default off).
- To a very large extent, whether a regular expression is linear or not depends on the implementation. (The recent improvements to the CRuby implementation show that very clearly). Does that mean that different implementations would warn for different regular expressions?
- In some cases, it may not be possible to conclusively say whether a regular expression will run in linear time or not. The proposed warning text makes this clear with the word "might".
- Non-linear can be quadratic, cubic, or exponential, and so on. A quadratic case on data with limited length (e.g. out of a database with fixed field lengths) might be absolutely harmless. Even an exponential case on very short data can be harmless.
- In many cases, the only person who would do a DoS attack would be the programmer him/herself.
- Overall, careful design and implementation is needed to make sure that this doesn't become a "crying wolf" warning that quickly gets deactivated and then no longer helps anybody.
Updated by Eregon (Benoit Daloze) over 1 year ago
duerst (Martin Dürst) wrote in #note-5:
Introducing such a warning might be a good idea. But there are several issues:
Thanks for the feedback.
- The warning should only be used when asked for with an option (i.e. default off).
Yes, as I mentioned in the description, it would be opt-in.
- To a very large extent, whether a regular expression is linear or not depends on the implementation. (The recent improvements to the CRuby implementation show that very clearly). Does that mean that different implementations would warn for different regular expressions?
Yes it can differ, although the restrictions for linear engines are fairly similar, see https://bugs.ruby-lang.org/issues/19104#note-3.
- In some cases, it may not be possible to conclusively say whether a regular expression will run in linear time or not. The proposed warning text makes this clear with the word "might".
- Non-linear can be quadratic, cubic, or exponential, and so on. A quadratic case on data with limited length (e.g. out of a database with fixed field lengths) might be absolutely harmless. Even an exponential case on very short data can be harmless.
In general, anything non-linear with user input is highly susceptible to DoS. Even quadratic is often enough to be problematic.
- In many cases, the only person who would do a DoS attack would be the programmer him/herself.
There were about a dozen ReDoS CVEs in the last 1-2 years. I don't think the reporters were the gem maintainers.
I would think in some cases it was reported by people who actually got a ReDoS in production (it seems the typical way these things get noticed).
- Overall, careful design and implementation is needed to make sure that this doesn't become a "crying wolf" warning that quickly gets deactivated and then no longer helps anybody.
I think any company or individual which cares about security/availability should take these warnings seriously.
We need to make them actionable though so they get resolved.
At least we should address the non-linear Regexps in default and bundled gems, as well as in RubyGems.
For the RubyGems Regexps above, the issue is the atomic groups.
One idea would be to check if the Regexp without atomic groups is Regexp.linear_time?
and if so use it, that seems easy.
Another idea would be a way to specify atomic groups for some Regexp(s) can be ignored for semantics and are only meant for performance (seems almost always the case). That would then use the linear engine if the Regexp with atomic groups changed to non-capturing groups can run on that linear engine, otherwise use the original regexp with atomic groups and warn.
Updated by Eregon (Benoit Daloze) over 1 year ago
nobu (Nobuyoshi Nakada) wrote in #note-4:
Literals can be statically checkable.
And even if it did, how could the external utility ensure it is loaded early enough, e.g., before RubyGems? It seems very difficult.
We are thinking about the kind of static linter, such as rubocop.
Of course external utilities can’t check dynamically created regexp, the monkey-patch is for filling this gap.
Right, but the monkey-patch has the problem it does not check any dynamically-created regexp created before the monkey-patch was loaded.
So if e.g. RubyGems or a default gem creates dynamically-created regexps, then it would be very difficult to catch those.
Dynamically-created regexps seem fairly common, because it's both Regexp.new
but also interpolated Regexps (/...#{...}.../
).
So a static linter like RuboCop seems of limited usefulness to ensure no ReDoS.
Especially since whether a Regexp is linear depends on the Ruby implementation's Regexp engine: https://bugs.ruby-lang.org/issues/19104#note-3
Updated by matz (Yukihiro Matsumoto) over 1 year ago
I object. The warning would strongly discourage the usage of back references (or other extended regexp), which are very convenient sometimes.
Even though they could cause ReDOS, that does not mean they should not be used all the time.
Matz.
Updated by Eregon (Benoit Daloze) over 1 year ago
@matz (Yukihiro Matsumoto) Given ReDoS are AFAIK the most common (by far) security vulnerability in Ruby, it warrants a reliable way to detect it, fix it and avoid it.
This way seems the best to me.
In fact I think only this approach or something very similar would be able to truly address ReDoS in Ruby.
For instance https://www.ruby-lang.org/en/news/ shows there were 3 ReDoS in stdlib in just 3 months (28 March - 29 June).
The warning would be opt-in, like performance warnings, so it would not be disruptive.
For gem (and app) maintainers which care about security, I think it would make sense to enable the warning and potentially customize Warning.warn
for the :regexp category to raise
(as shown in description).
That would discourage gems to use these often unsafe and non-linear regexp features.
I think this is exactly what we want for the Ruby community to solve this very frequent and serious security problem.
People can still use back references (or other extended regexp) features in their local scripts/programs if they like, and it won't even warn them unless they choose to enable regexp warnings.
BTW the exact list of features not supported is at https://bugs.ruby-lang.org/issues/19104#note-3
I believe these Regexp features are not necessary for 99+% gems, and not using them is by far the easiest way to guarantee no ReDoS is possible.
As we know, it's very very hard to estimate if a Regexp is susceptible to ReDoS if it's not detected as linear by the regexp engine (IOW it's very easy for programmers to accidentally write a non-linear regexp without the help of this warning).
@matz (Yukihiro Matsumoto) Could you reconsider? Maybe I should attend a dev meeting to understand what are the concerns about this feature?
Updated by matz (Yukihiro Matsumoto) over 1 year ago
If "non-linear" means real non-linear, I would not object. But in this case, “non-linear” means the recent memoize optimization cannot be applied. So if we add this warning, we will get a lot of false positives from safe regular expressions. That's my biggest concern.
If we provide the dedicated warning option for this “non-linear” regular expressions, it's barely acceptable only if false positives are clearly described in the document.
Matz.
Updated by Dan0042 (Daniel DeLorme) over 1 year ago
I agree with Matz, false positives are worse than no warnings at all. I use atomic groups and positive/negative lookahead/lookbehind all the time. In particular, atomic groups are a great tool to prevent ReDoS-susceptible regexps, so to warn against them would be really counterproductive.
But I think this "non-linear" flag could serve as a good first line of defense, by enabling a timeout by default. And "linear" regexps don't need the overhead of a timeout.
Updated by mame (Yusuke Endoh) about 1 year ago
- Status changed from Open to Rejected