Bug #3686

Error in parsing musicbrainz.org with rexml

Added by vinc-mai (Vincent Carmona) almost 2 years ago. Updated about 1 year ago.

[ruby-core:31693]
Status:Closed Start date:08/13/2010
Priority:Normal Due date:
Assignee:kou (Kouhei Sutou) % Done:

0%

Category:lib
Target version:2.0.0
ruby -v:ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

Description

rexml (ruby 1.9.1) fails to parse this url http://musicbrainz.org/show/puid/?puid=c6a6717f-6d88-4d0e-4c57-d6b949118072 .

require 'net/http'
require 'rexml/document'

url='http://musicbrainz.org/show/puid/?puid=c6a6717f-6d88-4d0e-4c57-d6b949118072'
res=Net::HTTP.get_response(URI.parse(url))
doc=REXML::Document.new(res.body)

/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<RuntimeError: Undeclared entity '&raquo;' in raw string "Skip to main content &raquo;"> (REXML::ParseException)
/usr/lib/ruby/1.9.1/rexml/text.rb:165:in `block in check'
/usr/lib/ruby/1.9.1/rexml/text.rb:153:in `scan'
/usr/lib/ruby/1.9.1/rexml/text.rb:153:in `check'
/usr/lib/ruby/1.9.1/rexml/text.rb:125:in `parent='
/usr/lib/ruby/1.9.1/rexml/parent.rb:19:in `add'
/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:45:in `parse'
/usr/lib/ruby/1.9.1/rexml/document.rb:228:in `build'
/usr/lib/ruby/1.9.1/rexml/document.rb:43:in `initialize'

$ ruby1.9.1 --version
ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

ruby 1.8.7 can parse these data.

History

Updated by meta (mathew murphy) over 1 year ago

FWIW, I checked and raquo is a valid entity in XHTML 1.1, the page passes validation, and it correctly states the XHTML 1.1 DTD.

The entity DTD also seems to be included in the modular framework module, a required part of XHTML as per http://www.w3.org/TR/xhtml-modularization/schema_module_defs.html#a_character_entities

i.e. the XHTML DTD includes xhtml-framework-1.mod, which includes xhtml-charent-1.mod, which includes xhtml-lat1.ent, which defines the raquo entity.

Updated by shyouhei (Shyouhei Urabe) over 1 year ago

  • Status changed from Open to Assigned
  • Assignee set to kou (Kouhei Sutou)

Updated by vinc-mai (Vincent Carmona) over 1 year ago

It seems that this bug was fixed. I can parse the page with ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux] under ubuntu 10.10.

Updated by kou (Kouhei Sutou) over 1 year ago

  • Category set to lib
  • Status changed from Assigned to Closed
  • Target version set to 2.0.0
It had been fixed in trunk.

Also available in: Atom PDF