Feature #20031
closedRegexp using greedy quantifier and unions on a big string uses a lot of memory
Description
Trying to match on some regexp using a greedy quantifier on any character (.*
) and unions (foo|bar
) uses a lot of memory if the string we want to match on is big.
Reproduction code¶
puts RUBY_DESCRIPTION
unless File.exist?("/tmp/fake_email.eml")
content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}"
File.write("/tmp/fake_email.eml", content)
end
content = File.read("/tmp/fake_email.eml")
def print_size(context, sike_kb)
puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB"
end
print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
print_size("content", content.bytesize / 1024)
REGEXP_UNION = Regexp.union(
"Accept-Language:",
"Thread-Topic:",
).freeze
reg = /.*#{REGEXP_UNION}/
p reg
GC.start
print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
allocated = GC.stat(:total_allocated_objects)
p content.match(reg)
p GC.stat(:total_allocated_objects) - allocated
GC.start
print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
p "---"
Output:
$ /usr/bin/time -v ruby full_repro.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
baseline: 71.3 MB
content: 47.7 MB
/.*(?-mix:Accept\-Language:|Thread\-Topic:)/
before match?: 71.3 MB
#<MatchData "Accept-Language:">
14
after match?: 71.6 MB
"---"
Command being timed: "ruby full_repro.rb"
User time (seconds): 1.15
System time (seconds): 0.74
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2423108
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 609575
Voluntary context switches: 52
Involuntary context switches: 12
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
strace:
$ sudo strace -p 588834
strace: Process 588834 attached
[...]
mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000
mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000
mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000
mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000
mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000
mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000
mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000
mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000
mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000
mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000
mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000
mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000
mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000
mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000
mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000
mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000
munmap(0x7f8cae4a8000, 4026535936) = 0
munmap(0x7f8e92224000, 12500992) = 0
writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32
[...]
We can see that the memory retained by Ruby is low (71.6 MB), but content.match(reg)
used 2_423_108 kbytes (~2.4 GB) in the regex engine.
Findings¶
To reproduce we need:
- a greedy quantifier on any character (
.*
) - a union with at least two members (
foo|bar
). More members won't increase the memory consumption further - to perform the match on a long string
The memory seems to increase linearly against the string size:
String size | Memory consumed |
---|---|
11.9 MB | 629_988 kbytes |
23.8 MB | 1_235_976 kbytes |
47.7 MB | 2_423_076 kbytes |
I was able to reproduce it on old Ruby versions as well.
@byroot (Jean Boussier) helped me to put all this together and according to his investigation the memory allocation would be coming from regexec.c:4100
(https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/regexec.c#L4100):
size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1;
uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t));
It seems surprising to use that much memory compared to the string size. Is it expected?