Feature #21943
openAdd StringScanner#get_int to extract capture group as Integer without intermediate String
Description
Motivation¶
The date library is being rewritten from C to pure Ruby. During this effort, Date._strptime was identified as a major performance bottleneck. Profiling revealed that the root cause is the overhead of extracting capture groups as Strings and then converting them to Integers:
sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc[1].to_i # allocates String "2024", converts to Integer, discards String
mon = sc[2].to_i # allocates String "06", converts to Integer, discards String
mday = sc[3].to_i # allocates String "15", converts to Integer, discards String
Each sc[n].to_i call allocates a temporary String object that is immediately discarded. When parsing dates, only the integer values are needed — the intermediate Strings serve no purpose.
In the C implementation of date, matched byte ranges are converted directly to integers without any String allocation. The pure Ruby version cannot do this with the current StringScanner API.
Proposal¶
Add StringScanner#get_int(index) that returns the captured substring at the given index as an Integer, converting directly from the matched byte range at the C level without allocating an intermediate String object.
scanner = StringScanner.new("2024-06-15")
scanner.scan(/(\d{4})-(\d{2})-(\d{2})/)
scanner.get_int(1) # => 2024
scanner.get_int(2) # => 6
scanner.get_int(3) # => 15
It returns nil in the same cases where scanner[index] would return nil (no match, index out of range, optional group did not participate).
Use case¶
The primary use case is Date._strptime in the pure Ruby date library. The fast path for %Y-%m-%d format currently does:
# Current: 3 temporary String allocations
sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc[1].to_i
mon = sc[2].to_i
mday = sc[3].to_i
With get_int:
# Proposed: 0 temporary String allocations
sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc.get_int(1)
mon = sc.get_int(2)
mday = sc.get_int(3)
This pattern appears throughout _strptime for every date/time component (%H, %M, %S, %m, %d, etc.), so the cumulative impact is significant.
Benchmark¶
Environment: Ruby 4.0.1, x86_64-linux
| Operation | i/s | per iteration | Comparison |
|---|---|---|---|
| sc.get_int(n) | 1,029,041.7 | 971.78 ns/i | (Reference) |
| sc[n].to_i | 791,945.6 | 1.26 μs/i | 1.30x slower |
get_int is 1.30x faster than sc[n].to_i for a typical date parsing scenario (3 capture groups). The improvement comes from eliminating 3 temporary String allocations per call.
In the context of Date._strptime("%Y-%m-%d"), this overhead is a significant portion of the total parse time, as shown in earlier profiling:
| Operation | Time |
|---|---|
C ext _strptime (reference) |
408 ns |
SC.new + scan + captures + .to_i x3 |
1,210 ns |
Pure Ruby _strptime_ymd total |
1,290 ns |
The capture extraction + .to_i conversion accounts for roughly 40% of the total parse time. get_int directly reduces this portion.
Implementation¶
A working implementation is available. It reuses the same index resolution logic as StringScanner#[] (including negative indices) but calls rb_cstr2inum on the matched byte range instead of extract_range, avoiding String object allocation entirely.
No data to display