Project

General

Profile

Actions

Bug #2130

closed

incorrect UTF8 encoding in CGI.unescapeHTML

Added by coldnebo (Larry Kyrala) over 14 years ago. Updated almost 13 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.8.6 (2009-06-08 patchlevel 369) [x86_64-linux]
[ruby-core:25702]

Description

=begin
In CGI.unescapeHTML() in cgi.rb note that the html literal encoding is translated thus:
(from http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105)

   when /\A#x([0-9a-f]+)\z/ni then
     if $1.hex < 256
       $1.hex.chr
     else
       if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
         [$1.hex].pack("U")

The second line should be:
if $1.hex < 128

in order to conform with standards.

Explanation:
The inputs of the unescapeHTML() method are assumed to be valid HTML. Outputs are apparently intended to be valid UTF-8 ruby strings (see Array.pack("U")). However, for hex values 80-FF, pack is bypassed ($1.hex < 256 above), so these characters are incorrectly unescaped.

According to the 4.01 spec, single-byte hex entity encodings from 80-FF are valid HTML since they conform to the "ISO 10646 hexadecimal character number H". While this is a valid HTML entity, it is important to note that one-byte encodings above 7F are not valid UTF-8 encodings unless they are converted to their two-byte equivalents as per the UTF-8 specification (U+H). (Note that one-byte encodings from 80-FF are also not valid XML, since the XML spec requires entity encodings to be valid UTF-8 sequences.)

Background:
I found this error while debugging a java-based webservice that returns HTML escaped entities. The bug is partly on the webservice (since the webservice is XML-based, not HTML-based), but it led me to find the CGI.unescapeHTML bug while trying to implement a workaround. This is a borderline pedantic issue, but I figured it might help other people having this problem. Also, I might have made a mistake somewhere in the interpretation or the intent of the code, so feel free to comment. Thanks!

References:
http://www.w3.org/TR/html401/charset.html#h-5.3.1
http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/ISO_10646
http://corelib.rubyonrails.org/classes/Array.html#M000460
http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105
=end

Actions #1

Updated by coldnebo (Larry Kyrala) over 14 years ago

=begin
A friend pointed me to the HTMLEntities gem as a workaround. Notice that the HTMLEntities.decode method works because it essentially runs all entities through Array.pack("U"):

File lib/htmlentities.rb, line 45

def decode(source)
return source.to_s.gsub(named_entity_regexp) {
(cp = map[$1]) ? [cp].pack('U') : $&
}.gsub(/&#([0-9]{1,7});|&#x([0-9a-f]{1,6});/i) {
$1 ? [$1.to_i].pack('U') : [$2.to_i(16)].pack('U')
}
end

FYI. Thanks!

References:
http://htmlentities.rubyforge.org/
http://htmlentities.rubyforge.org/doc/classes/HTMLEntities.html#M000004
=end

Actions #2

Updated by coldnebo (Larry Kyrala) over 14 years ago

=begin
More context about how I discovered this: I was passing the output of CGI.unescapeHTML() to ActiveSupport::Multibyte::Char.g_unpack() and received the following exception:
(ActiveSupport::Multibyte::EncodingError) "malformed UTF-8 character"

Investigating this problem led to finding the bug above.
=end

Actions #3

Updated by xibbar (Takeyuki FUJIOKA) over 14 years ago

  • Assignee set to xibbar (Takeyuki FUJIOKA)

=begin

=end

Actions #4

Updated by xibbar (Takeyuki FUJIOKA) over 14 years ago

  • Status changed from Open to Closed

=begin
fixed in r25232
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0