Bug #2130
closedincorrect UTF8 encoding in CGI.unescapeHTML
Description
=begin
In CGI.unescapeHTML() in cgi.rb note that the html literal encoding is translated thus:
(from http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105)
when /\A#x([0-9a-f]+)\z/ni then
if $1.hex < 256
$1.hex.chr
else
if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
[$1.hex].pack("U")
The second line should be:
if $1.hex < 128
in order to conform with standards.
Explanation:
The inputs of the unescapeHTML() method are assumed to be valid HTML. Outputs are apparently intended to be valid UTF-8 ruby strings (see Array.pack("U")). However, for hex values 80-FF, pack is bypassed ($1.hex < 256 above), so these characters are incorrectly unescaped.
According to the 4.01 spec, single-byte hex entity encodings from 80-FF are valid HTML since they conform to the "ISO 10646 hexadecimal character number H". While this is a valid HTML entity, it is important to note that one-byte encodings above 7F are not valid UTF-8 encodings unless they are converted to their two-byte equivalents as per the UTF-8 specification (U+H). (Note that one-byte encodings from 80-FF are also not valid XML, since the XML spec requires entity encodings to be valid UTF-8 sequences.)
Background:
I found this error while debugging a java-based webservice that returns HTML escaped entities. The bug is partly on the webservice (since the webservice is XML-based, not HTML-based), but it led me to find the CGI.unescapeHTML bug while trying to implement a workaround. This is a borderline pedantic issue, but I figured it might help other people having this problem. Also, I might have made a mistake somewhere in the interpretation or the intent of the code, so feel free to comment. Thanks!
References:
http://www.w3.org/TR/html401/charset.html#h-5.3.1
http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/ISO_10646
http://corelib.rubyonrails.org/classes/Array.html#M000460
http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105
=end
Updated by coldnebo (Larry Kyrala) about 15 years ago
=begin
A friend pointed me to the HTMLEntities gem as a workaround. Notice that the HTMLEntities.decode method works because it essentially runs all entities through Array.pack("U"):
File lib/htmlentities.rb, line 45¶
def decode(source)
return source.to_s.gsub(named_entity_regexp) {
(cp = map[$1]) ? [cp].pack('U') : $&
}.gsub(/&#([0-9]{1,7});|&#x([0-9a-f]{1,6});/i) {
$1 ? [$1.to_i].pack('U') : [$2.to_i(16)].pack('U')
}
end
FYI. Thanks!
References:
http://htmlentities.rubyforge.org/
http://htmlentities.rubyforge.org/doc/classes/HTMLEntities.html#M000004
=end
Updated by coldnebo (Larry Kyrala) about 15 years ago
=begin
More context about how I discovered this: I was passing the output of CGI.unescapeHTML() to ActiveSupport::Multibyte::Char.g_unpack() and received the following exception:
(ActiveSupport::Multibyte::EncodingError) "malformed UTF-8 character"
Investigating this problem led to finding the bug above.
=end
Updated by xibbar (Takeyuki FUJIOKA) about 15 years ago
- Assignee set to xibbar (Takeyuki FUJIOKA)
=begin
=end
Updated by xibbar (Takeyuki FUJIOKA) almost 15 years ago
- Status changed from Open to Closed
=begin
fixed in r25232
=end