Feature #20576
closedAdd MatchData#bytebegin and MatchData#byteend
Description
I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.
Pull request: https://github.com/ruby/ruby/pull/10973
One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:
voyager:ruby$ cat b.rb
require "benchmark"
require "strscan"
text = "あ" * 100000
Benchmark.bmbm do |b|
b.report("byteoffset(0)[1]") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteoffset(0)[1]
end
end
b.report("byteend(0)") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteend(0)
end
end
end
voyager:ruby$ ./tool/runruby.rb b.rb
Rehearsal ----------------------------------------------------
byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963)
byteend(0) 0.018149 0.000000 0.018149 ( 0.018151)
------------------------------------------- total: 0.039100sec
user system total real
byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822)
byteend(0) 0.017455 0.000000 0.017455 ( 0.017455)
Updated by Eregon (Benoit Daloze) 10 months ago
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
Regarding naming, byteend
seems hard to read, I think byte_begin
/byte_end
is much clearer.
Updated by shugo (Shugo Maeda) 10 months ago
Eregon (Benoit Daloze) wrote in #note-1:
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1]
when only the end offset is needed.
Regarding naming,
byteend
seems hard to read, I thinkbyte_begin
/byte_end
is much clearer.
I proposed byteend
for consistency with existing methods such as byteoffset.
If we choose byte_end
, it may be better to introduce new aliases for such existing methods.
Updated by matz (Yukihiro Matsumoto) 10 months ago
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin
, byteend
are follow the byteindex
tradition, but it is very hard to read (especially byteend
). Any other name suggestions?
Matz.
Updated by shugo (Shugo Maeda) 10 months ago
matz (Yukihiro Matsumoto) wrote in #note-3:
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names
bytebegin
,byteend
are follow thebyteindex
tradition, but it is very hard to read (especiallybyteend
). Any other name suggestions?
I came up with names begin_in_bytes
and end_in_bytes
, but byte_begin
/ byte_end
suggested by Eregon may be better.
Updated by matz (Yukihiro Matsumoto) 9 months ago
OK. I didn't like the names (especially byteend), but after looking at them for a while I got used to it and was ready to compromise.
Matz.
Updated by shugo (Shugo Maeda) 9 months ago
- Status changed from Open to Closed
Applied in changeset git|e048a073a3cba04576b8f6a1673c283e4e20cd90.
Add MatchData#bytebegin and MatchData#byteend
These methods return the byte-based offset of the beginning or end of the specified match.
[Feature #20576]