Project

General

Profile

Actions

Bug #2130

closed

incorrect UTF8 encoding in CGI.unescapeHTML

Added by coldnebo (Larry Kyrala) about 15 years ago. Updated over 13 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.8.6 (2009-06-08 patchlevel 369) [x86_64-linux]
[ruby-core:25702]

Description

=begin
In CGI.unescapeHTML() in cgi.rb note that the html literal encoding is translated thus:
(from http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105)

   when /\A#x([0-9a-f]+)\z/ni then
     if $1.hex < 256
       $1.hex.chr
     else
       if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
         [$1.hex].pack("U")

The second line should be:
if $1.hex < 128

in order to conform with standards.

Explanation:
The inputs of the unescapeHTML() method are assumed to be valid HTML. Outputs are apparently intended to be valid UTF-8 ruby strings (see Array.pack("U")). However, for hex values 80-FF, pack is bypassed ($1.hex < 256 above), so these characters are incorrectly unescaped.

According to the 4.01 spec, single-byte hex entity encodings from 80-FF are valid HTML since they conform to the "ISO 10646 hexadecimal character number H". While this is a valid HTML entity, it is important to note that one-byte encodings above 7F are not valid UTF-8 encodings unless they are converted to their two-byte equivalents as per the UTF-8 specification (U+H). (Note that one-byte encodings from 80-FF are also not valid XML, since the XML spec requires entity encodings to be valid UTF-8 sequences.)

Background:
I found this error while debugging a java-based webservice that returns HTML escaped entities. The bug is partly on the webservice (since the webservice is XML-based, not HTML-based), but it led me to find the CGI.unescapeHTML bug while trying to implement a workaround. This is a borderline pedantic issue, but I figured it might help other people having this problem. Also, I might have made a mistake somewhere in the interpretation or the intent of the code, so feel free to comment. Thanks!

References:
http://www.w3.org/TR/html401/charset.html#h-5.3.1
http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/ISO_10646
http://corelib.rubyonrails.org/classes/Array.html#M000460
http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0