Feature #19090
closedDo not duplicate an unescaped string in CGI.escapeHTML
Description
Proposal¶
Stop guaranteeing that GGI.escapeHTML
returns a new string even if there's nothing to be escaped.
More specifically, stop calling this rb_str_dup
https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.
Background¶
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why rb_str_dup
was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original gsub
-based implementation always returns a new object. As a result, even while many people use CGI.escapeHTML
as an optimized implementation for escaping HTML today, it ended up having a compromised performance.
Motivation¶
The motivation is to improve performance. By just doing so, escaping a pre-allocated "string"
becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.
The most major use case of CGP.escapeHTML
is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of '"&<>
characters. So we should stop wasting that to optimize that case.
[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with gem-codesearch
today.
The only reason to maintain the current behavior would be to allow using a return value of CGI.escapeHTML
as a buffer for creating another longer string starting with the escaped value, but using CGI.escapeHTML
to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.
Why not an optional flag like CGI.escapeHTML(str, dup: false)
?¶
Two reasons:
- The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using
CGI.escapeHTML
less readable just for maintaining a use case that doesn't exist. - Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use
Primitive
to address that, but this is a default gem and we can't use that.- We could workaround that if we choose
CGI.escapeHTML(str, false)
, but again it'd spoil the readability for maintaining an invalid use case.
- We could workaround that if we choose
Why not a new method?¶
It's a good idea actually, but with escapeHTML
, escape_html
, and h
aliased to it already, I can't think of a good name for it. And again, not calling it escapeHTML
or escape_html
would spoil the readability for no valid reason.
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Description updated (diff)
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Related to Bug #11858: CGI.escapeHTML should NOT return frozen string added
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) about 2 years ago
Isn't rb_str_dup
copy-on-write and so should be fairly cheap?
Updated by Dan0042 (Daniel DeLorme) about 2 years ago
I agree the dup is unnecessary since 99.99% of the time you're just doing buf << CGI.escapeHTML(str)
. Not sure the performance gain would be measurable. A new method like CGI.escapeHTML!
would make sense to me, as it indicates the danger of mutating the return value.
I did a search for CGI.escapeHTML
in gems: https://pastebin.com/7HYUTASZ
There's a lot of safe usage where the result is directly used in interpolation or concatenation.
There's also some "potentially unsafe" usage like
def escape_html(string)
CGI.escapeHTML(string)
end
which depends on how the escape_html
method is called, but in general the only valid use case for these methods is when appending to a buffer. So I'd say it's a low-risk change.
Updated by k0kubun (Takashi Kokubun) about 2 years ago
Isn't rb_str_dup copy-on-write and so should be fairly cheap?
As I wrote in the description, calling rb_str_dup
is making the whole method 1.34x slower no matter how efficient CoW is. Whether the 34% slowdown is cheap or not depends on how often the method is used and hits the non-escaped case. What I'm saying is that this method is used for almost all embedded expressions in templates and it hits the non-escaped case most of the time. Let's say you have a page that lists 100 resources with 5 fields, it would call CGI.escapeHTML
at least 500 times and many of them could have no '"&<>
characters.
Generally, escaping an HTML is known to be one of the largest bottlenecks in template engine benchmarks that enable HTML escaping, which is why I've literally spent years optimizing this method. I would never say a 34% slowdown in CGI.escapeHTML
is cheap. When I modified the benchmark created by the Slim team to escape embedded expressions, the benchmark became 1.1x faster by just removing this rb_str_dup
call. So the 34% slowdown in the microbenchmark of this method translates to a 10% slowdown in the template rendering.
A new method like CGI.escapeHTML! would make sense to me, as it indicates the danger of mutating the return value.
It's not doing any mutation by itself, and I don't know of any existing method named like that. Thus Matz would not like the name, and I'm not sure if the library maintainer (@nobu (Nobuyoshi Nakada)) likes it either. But thanks for proposing an alternative name.
Updated by Dan0042 (Daniel DeLorme) about 2 years ago
You know, if speed is that much a concern, I think optimized_escape_html2
should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).
Updated by Dan0042 (Daniel DeLorme) about 2 years ago
Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all. As such I would like to propose:
CGI.escapeHTML(str, "")
append to new string (current behavior)
CGI.escapeHTML(str, nil)
append to new string or return unmodified str
(new default?)
CGI.escapeHTML(str, buffer)
append to buffer (most efficient when rendering mutliple strings in template)
Updated by k0kubun (Takashi Kokubun) about 2 years ago
You know, if speed is that much a concern, I think optimized_escape_html2 should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).
ALLOCV_N
is essentially alloca
. I'd say this one is indeed "fairly cheap". Feel free to prove it otherwise by submitting a patch and benchmarking it though.
As to memcpy
, while technically memcpy
isn't called for the case we're discussing, *dest++ = c;
feels like an unneeded overhead and I have an idea to improve it. I'd discuss it in a different ticket though.
Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all.
You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails. I believe it could/should be fixed, but it's a different discussion.
Updated by k0kubun (Takashi Kokubun) about 2 years ago
We might be able to make it efficient for a non-String buffer too, so I'll give it a try first. Thanks for the idea.
Updated by k0kubun (Takashi Kokubun) about 2 years ago
- Status changed from Open to Closed
On second thought, it might not be necessarily more efficient even with a bare String buffer because the point of the current CGI.escapeHTML
implementation is to pre-allocate a 6x-size buffer on the stack to avoid multiple reallocations for buffer extension. That optimization is hard if you need to directly write to the argument String from the beginning; the best thing you could do would be to realloc
it to a 6x size first and then realloc
it again to shrink it. Doing object reallocation twice seems expensive compared to just manipulating a stack pointer, even combined with extra memcpy
.
Plus, as a maintainer of many template engines, this interface seems hard to use in some cases. It seems to help only limited places with a lot of complexity.
I came up with a completely different solution. I'll file it as a different ticket after verifying the idea. So closing this. Thank you for the discussion.
Updated by Dan0042 (Daniel DeLorme) about 2 years ago
It's true ALLOCV_N is almost free but I was thinking more about the cost of copying between buffers
a) when nothing needs escaping: copy bytes to tmp buffer then discard
b) when something needs escaping: copy bytes to tmp buffer then copy again to final string
You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails.
If buffer is bare String we can use CGI.escapeHTML(str, buffer)
, and if something else (Array?) we can use CGI.escapeHTML(str)
. I think it's nice to support both cases.
6x allocation is optimal for worst cast like '"' * 1000
but personally I would optimize for common case. So grow the string by an amount sufficient for 99.9% of cases (maybe something like (N+30)*1.2) and accept a realloc for degenerate cases. It's a different kind of choice/trade-off.
Thank you for the discussion, it's a fun topic.
Updated by Eregon (Benoit Daloze) about 2 years ago
- Related to Feature #19102: Optimize ERB::Util.html_escape more than CGI.escapeHTML for template engines added
Updated by Eregon (Benoit Daloze) about 2 years ago
I did not expect rb_str_dup()
is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI), as the fastest way is implementation-dependent.
See https://bugs.ruby-lang.org/issues/19102#note-4
I think we should add an optional argument to avoid copying the string when unchanged, that's easy and avoids yet another place/method with that escaping logic.
Updated by k0kubun (Takashi Kokubun) about 2 years ago
I filed a PR for truffleruby https://github.com/ruby/erb/pull/39.
I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI)
I didn't get what you mean in this part. CGI and ERB are both default gems. There's no difference in its "coreness" between them.
Updated by Eregon (Benoit Daloze) about 2 years ago
Right, I forgot CGI is a default gem too.
I think it seems cleaner for other template engines (e.g. haml, slim, etc) to depend (as in require "cgi") on CGI vs depending on ERB, i.e. CGI feels smaller/like a subset of ERB. In other words, it doesn't seem ideal for other template engines to depend on the ERB template engine just for escaping (although practically speaking there is no issue, just from a design perspective).
Updated by k0kubun (Takashi Kokubun) about 2 years ago
I get what you're saying. My position on this issue is:
-
CGI
is not a good place either unless you're writing a CGI application. ERB also hasERB::Escape
now, and I'd say "embedded Ruby escape" is a better module name than "CGI" for the purpose. - ERB 4.0.0 was shipped with a new file
erb/util.rb
, which allows you to load only a couple of escape methods, not loading an extra template engine. - The way we defined the method was designed by @jeremyevans0 (Jeremy Evans) for Erubi. Loading the ERB template engine would be the last thing the Erubi maintainer would like to do. So, thanks to his idea, it's possible to load only escape methods.
-
ERB::Util.html_escape
is monkey-patched by Rails and it's been the only officialhtml_safe
-compatible HTML escape method for years (while it's been using Erubis or Erubi). Since it's the only Rails-official way to do it, moving it to another place would be unnecessarily disruptive.
Updated by Eregon (Benoit Daloze) about 2 years ago
Thank you for the explanation, that's very good arguments and they address my concerns.