Project

General

Profile

Actions

Feature #17992

open

Upstreaming the htmlentities gem into CGI#.(un)escape_html

Added by AMomchilov (Alexander Momchilov) 4 months ago. Updated 4 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:104288]

Description

Hi there,

I was looking to unescape some HTML entities in a String, and I discovered that CGI#.(un)escape_html is really limited. Many StackOverflow questions share a similar disappointment, and point users to using the htmlentities gem:

  1. https://stackoverflow.com/a/383561/3141234
  2. https://stackoverflow.com/a/22926384/3141234

This solved my problem, but I feel like something this standard/universal should be built-in. To that end, I'm interested in working on merging the htmlentities gem into CGI's repo. Would this be a welcome change?

  • I've e-mailed the author (Paul Battley) privately, and got his blessing to do so.
  • It's MIT licensed, so that should be OK.

Updated by k0kubun (Takashi Kokubun) 4 months ago

  • Status changed from Open to Feedback

Could you clarify a bit more context about why you'd like to escape these characters not supported in CGI.escapeHTML?

I believe CGI.escapeHTML has been primarily used to avoid breaking the DOM structure by escaping dynamic contents in HTML templates with optimal performance. For this particular use case, escaping only <>'" (and &) is a very understandable behavior to me, and I would prefer rather not escaping any other character for the best performance as long as it's not considered as a security vulnerability.

require 'benchmark/ips'
require 'htmlentities'
require 'cgi/escape'

str = <<~HTML
  <body>
  <div>
      <h1>Example Domain</h1>
      <p>This domain is established to be used for illustrative examples in documents. You may use this
      domain in examples without prior coordination or asking for permission.</p>
      <p><a href="http://www.iana.org/domains/example">More information...</a></p>
  </div>
  </body>
HTML
coder = HTMLEntities.new

Benchmark.ips do |x|
  x.report("CGI.escapeHTML") { CGI.escapeHTML(str) }
  x.report("HTMLEntities #{HTMLEntities::VERSION::STRING}") { coder.encode(str) }
  x.compare!
end
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
Warming up --------------------------------------
      CGI.escapeHTML   112.937k i/100ms
  HTMLEntities 4.3.4     1.029k i/100ms
Calculating -------------------------------------
      CGI.escapeHTML      1.131M (± 2.3%) i/s -      5.760M in   5.095252s
  HTMLEntities 4.3.4     10.281k (± 2.1%) i/s -     51.450k in   5.006333s

Comparison:
      CGI.escapeHTML:  1131036.5 i/s
  HTMLEntities 4.3.4:    10281.4 i/s - 110.01x  (± 0.00) slower

Note that CGI.escapeHTML is the default HTML escape method of Rails. You'll make every embedded Ruby expression in Rails 110x slower if you suddenly replace CGP.escapeHTML with that gem.

We may want to support escaping these characters for some other usages, but for backward compatibility and the performance in existing places, I'd like the feature to be enabled only by a new option or another method.

Actions #2

Updated by jeremyevans0 (Jeremy Evans) 4 months ago

  • Backport deleted (2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN)
  • Tracker changed from Bug to Feature

Updated by AMomchilov (Alexander Momchilov) 4 months ago

k0kubun (Takashi Kokubun) wrote in #note-1:

Could you clarify a bit more context about why you'd like to escape these characters not supported in CGI.escapeHTML?

We have a dataset of user content that contains HTML entities (whether intended or not), which we would like rendered by a non-browser presentation layer. We want our server backend to decode these into their proper characters.

You make a good point about the performance. Definitely don't want to touch escapeHTML.

Granted, the name, "escapeHTML" is about escaping HTML, and not about necessarily decoding HTML entites, so the current behaviour makes sense.

What if the more-general (but slower) entity decoding behaviour was added under, say, (de|en)code_html_entities?

Updated by sawa (Tsuyoshi Sawada) 4 months ago

Note that escaping and unescaping are not symmetric. There may be a point in keeping escaping to the minimum for performance reasons. However, when it comes to unescaping, it would be useless unless all escaped characters are completely unescaped, including the optionally escaped characters. And I think that is the main focus of the linked questions.

So, while it may be better to execute optional escaping only by calling a method under a different method name or by using a gem, the unescaping feature of CGI should be supplemented and be made complete by default regardless of the impact on performance.

Updated by AMomchilov (Alexander Momchilov) 4 months ago

I don't know anything about anything, but having a basic-but-fast operation that only does the bare minimum for security, makes sense to me.

I just want to push for a more complete alternative to exist in some form or another.

Updated by AMomchilov (Alexander Momchilov) 4 months ago

I took the first steps in incorporating this gem, starting with:

git subtree add \
     --prefix=lib/cgi/htmlentities \
     https://github.com/threedaymonk/htmlentities \
     master

Which puts the entire htmlentities repo under lib/cgi/htmlentities`. This is really nice because it preserves all the git history of the files, original authorship, etc.

  • From there, I started pulling the subfolders apart, such as moving its tests from
    lib/cgi/htmlentities/tests/*_test.rb to test/cgi/htmlentities/test_*.rb.

    Here's what that process looked like:

    https://github.com/amomchilov/cgi/compare/embed-htmlentities-gem-subtree...amomchilov:integrate-htmlentities-gem?expand=1

  • I also name-spaced all of its files from HTMLEntities::* to CGI::HTMLEntities::*, so that this doesn't conflict with any existing code that uses both cgi and htmlentities.

  • Next I plan to add require 'cgi/htmlentities/lib/htmlentities'" tolib/cgi/util.rb, and the add#escape_html_entitesand#unescape_html_entitiesmethods, which forward ontoHTMLEntities`'s methods.

How does this look so far?

FYI Here's the "full diff" between my work and CGI's master branch:

https://github.com/ruby/cgi/compare/master...amomchilov:integrate-htmlentities-gem?expand=1

Updated by duerst (Martin Dürst) 4 months ago

It is difficult to make this complete without overdoing it because there are many different kinds of entity sets. The htmlentities gem distinguishes three (html4, xhtml1, and expanded). Both html4 and xhtml1 come in at around 250 entities; expanded is around 1000. But it doesn't include the html5 set, which seems to be even bigger (https://html.spec.whatwg.org/entities.json contains 2231 entries, although some of them are duplicates (with and without semicolon)). It is highly unclear what the correct set would be if we expand the coverage of CGI.escapeHTML and friends.

Updated by duerst (Martin Dürst) 4 months ago

Sorry, forgot to give a link to a human-readable table of html5 entities: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references.

Updated by AMomchilov (Alexander Momchilov) 4 months ago

I'm reasonably convinced that augmenting the existing unescapeHTML API is off the table, for performance reasons.

I plan to add this under a new (de|en)code_html_entities name.

It could be parameterized just like HTMLEntities, to let you pick one of the entity sets" (it calls them "flavors")

Actions

Also available in: Atom PDF