Feature #19102
closedOptimize ERB::Util.html_escape more than CGI.escapeHTML for template engines
Description
Proposal¶
Change the behavior of ERB::Util.html_escape
in the following two parts:
- Skip converting an argument with
#to_s
if the argument is already aT_STRING
. - Do not allocate and return a new String when nothing needs to be escaped.
Background¶
The current ERB::Util.html_escape
is implemented as CGI.escapeHTML(s.to_s)
. So the performance is almost equal to CGI.escapeHTML
except for the to_s
call. Because it's common to embed non-String expressions in template engines, a template engine typically calls to_s
to convert non-String expressions to a String before escaping it, which is why the difference exists. Proposal (1) is useful for optimizing the case that something that's already a String
is embedded. We ignore the extreme case that String#to_s
is weirdly monkey-patched.
As to proposal (2), my original implementation of CGI.escapeHTML
https://github.com/ruby/ruby/pull/1164 was not calling rb_str_dup
for that case. However, because [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility with the old gsub
-based implementation, we added the unneeded rb_str_dup
call and the performance for that case has been compromised. This behavior is completely unnecessary for template engines. On the other hand, because ERB::Util.html_escape
is a helper for ERB, we should not need to consider any backward compatibility that is not relevant to ERB or any template engines. So proposal (2) should be possible in ERB::Util.html_escape
unlike CGI.escapeHTML
.
Benchmark¶
Implementation: https://github.com/ruby/erb/pull/27
require 'benchmark/ips'
require 'erb'
class << ERB::Util
def html_escape_old(s)
CGI.escapeHTML(s.to_s)
end
end
Benchmark.ips do |x|
s = 'hello world'
x.report('before') { ERB::Util.html_escape_old(s) }
x.report('after') { ERB::Util.html_escape(s) }
x.compare!
end
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
Warming up --------------------------------------
before 1.066M i/100ms
after 1.879M i/100ms
Calculating -------------------------------------
before 10.615M (± 0.3%) i/s - 53.320M in 5.023083s
after 18.742M (± 0.4%) i/s - 93.929M in 5.011847s
Comparison:
after: 18741747.6 i/s
before: 10615137.1 i/s - 1.77x (± 0.00) slower
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Description updated (diff)
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Status changed from Open to Closed
Applied in changeset git|20efeaddbe246f3b2eaee4f17f54a814777176a8.
[ruby/erb] Optimize away to_s if it's already T_STRING
[Feature #19102]https://github.com/ruby/erb/commit/38c6e182fb
Updated by Eregon (Benoit Daloze) about 2 years ago
- Related to Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML added
Updated by Eregon (Benoit Daloze) about 2 years ago
I think it is unfortunate to add a C extension for ERB for that, ERB was always pure-Ruby and that was nice.
Also the C extension is slower on TruffleRuby, the Regexp is actually JIT-compiled and can use vectorization, unlike that C code. Also part of it is RSTRING_PTR() basically forces a copy from managed memory (byte[]) to native memory (char*) on TruffleRuby.
truffleruby 23.0.0-dev-57e53f8a, like ruby 3.1.2, GraalVM CE Native [x86_64-linux]
CGI.escapeHTML 31.985M (± 1.2%) i/s - 160.093M in 5.006001s
ERB::Util.html_escape 7.427M (± 3.3%) i/s - 37.162M in 5.009721s
and CRuby 3.1 is:
ERB::Util.html_escape 14.551M (± 0.8%) i/s - 73.308M in 5.038335s
CGI.escapeHTML 10.065M (± 0.6%) i/s - 51.054M in 5.072629s
Given those results, could you build the C extension only for CRuby?
I think it would also be much nicer to keep the optimized HTML escape in CGI which is in stdlib, so it can be used by all templates engines.
In #19090 I did not expect rb_str_dup()
is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
A new method in CGI sounds best, or probably nicer an optional argument whether to always return a copy.
It seems tricky to change the existing method to return a mutable string as-is, I guess the person who found #11858 actually found it due to that incompatibility.
Ruby users seem to assume in general if a core/stdlib method returns a string either it's frozen or they are free to modify it, but here it would actually also mutate the original String passed to CGI.escapeHTML
.
String#to_s
really sounds like something every Ruby JIT/VM should be able to trivially optimize.
So that I think should be an insignificant cost, even on CRuby (and if it isn't it should be easy to fix).
Updated by k0kubun (Takashi Kokubun) about 2 years ago
Would you mind reviewing https://github.com/ruby/erb/pull/39? I skipped this ticket's change for truffleruby in the PR.