Project

General

Profile

Actions

Feature #20031

closed

Regexp using greedy quantifier and unions on a big string uses a lot of memory

Added by FabienChaynes (Fabien Chaynes) 7 months ago. Updated 6 months ago.

Status:
Feedback
Assignee:
-
Target version:
-
[ruby-core:115550]

Description

Trying to match on some regexp using a greedy quantifier on any character (.*) and unions (foo|bar) uses a lot of memory if the string we want to match on is big.

Reproduction code

puts RUBY_DESCRIPTION

unless File.exist?("/tmp/fake_email.eml")
  content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}"
  File.write("/tmp/fake_email.eml", content)
end

content = File.read("/tmp/fake_email.eml")

def print_size(context, sike_kb)
  puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB"
end

print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)

print_size("content", content.bytesize / 1024)

REGEXP_UNION = Regexp.union(
  "Accept-Language:",
  "Thread-Topic:",
).freeze
reg = /.*#{REGEXP_UNION}/
p reg

GC.start
print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
allocated = GC.stat(:total_allocated_objects)

p content.match(reg)

p GC.stat(:total_allocated_objects) - allocated
GC.start
print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
p "---"

Output:

$ /usr/bin/time -v ruby full_repro.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
baseline: 71.3 MB
content: 47.7 MB
/.*(?-mix:Accept\-Language:|Thread\-Topic:)/
before match?: 71.3 MB
#<MatchData "Accept-Language:">
14
after match?: 71.6 MB
"---"
  Command being timed: "ruby full_repro.rb"
  User time (seconds): 1.15
  System time (seconds): 0.74
  Percent of CPU this job got: 100%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 2423108
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 609575
  Voluntary context switches: 52
  Involuntary context switches: 12
  Swaps: 0
  File system inputs: 0
  File system outputs: 0
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

strace:

$ sudo strace -p 588834
strace: Process 588834 attached
[...]
mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000
mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000
mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000
mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000
mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000
mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000
mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000
mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000
mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000
mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000
mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000
mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000
mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000
mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000
mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000
mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000
munmap(0x7f8cae4a8000, 4026535936)      = 0
munmap(0x7f8e92224000, 12500992)        = 0
writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32
[...]

We can see that the memory retained by Ruby is low (71.6 MB), but content.match(reg) used 2_423_108 kbytes (~2.4 GB) in the regex engine.

Findings

To reproduce we need:

  • a greedy quantifier on any character (.*)
  • a union with at least two members (foo|bar). More members won't increase the memory consumption further
  • to perform the match on a long string

The memory seems to increase linearly against the string size:

String size Memory consumed
11.9 MB 629_988 kbytes
23.8 MB 1_235_976 kbytes
47.7 MB 2_423_076 kbytes

I was able to reproduce it on old Ruby versions as well.

@byroot (Jean Boussier) helped me to put all this together and according to his investigation the memory allocation would be coming from regexec.c:4100 (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/regexec.c#L4100):

size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1;
uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t));

It seems surprising to use that much memory compared to the string size. Is it expected?

Actions

Also available in: Atom PDF

Like0
Like0Like1Like1