Project

General

Profile

Actions

Bug #19196

closed

The string saved to Tempfile from URI.open escapes "&" character

Added by westoque (William Estoque) over 1 year ago. Updated over 1 year ago.

Status:
Rejected
Assignee:
-
Target version:
-
[ruby-core:111263]

Description

When I am reading the string response from a URI.open, the response is not equivalent to the response body.

How to reproduce:

url = "https://www.podcastone.com/podcast?categoryID2=1237"

handle = URI.open(url)
=> #<Tempfile:/path/to/tempfile>

puts handle.read
.... https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&#38;awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&#38;adwNewID3=true&#38;awNetwork=309...

In the browser, the actual string reads:

https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&#38;awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&#38;adwNewID3=true&#38;awNetwork=309

Notice the characters #38;

My initial research is that it's because the Tempfile that gets created is in ascii-8bit, and in ascii-8bit, the amperstand is a "38".

I propose that we should have a way to force the encoding of the Tempfile to UTF8 so that this character is not escaped and the string encoding is preserved.

Actions #1

Updated by westoque (William Estoque) over 1 year ago

  • Subject changed from The string saved to Tempfile from URI.open escapes "&" characters to The string saved to Tempfile from URI.open escapes "&" character
Actions #2

Updated by westoque (William Estoque) over 1 year ago

  • Description updated (diff)
Actions #3

Updated by westoque (William Estoque) over 1 year ago

  • Description updated (diff)

Updated by ufuk (Ufuk Kayserilioglu) over 1 year ago

The content you are reading is XML and &#38; characters are there because of XML-escaping. They are not related to any kind of file encoding, ASCII-8BIT or UTF-8.

Moreover, they are there in the response from the server, which you can see by looking at the output of curl for the same resource:

$ curl -s "https://www.podcastone.com/podcast?categoryID2=1237" | grep "aw.noxsolutions.com/launchpod/adswizz/1237/762-"
...
<enclosure length="74614442" type="audio/mpeg" url="https://dts.podtrac.com/redirect.mp3/pdst.fm/e/chrt.fm/track/E2G895/aw.noxsolutions.com/launchpod/adswizz/1237/762-FeedbackFriday-249-V2_mzwq_b1dc1677.mp3?awCollectionId=1237&#38;awEpisodeId=ee01b21a-878d-4be4-974c-e504b1dc1677&#38;adwNewID3=true&#38;awNetwork=309"></enclosure>
...

So, this is not a Ruby problem at all. On the contrary, Ruby can help you unescape these characters:

require "cgi"
CGI.unescapeHTML("foo&#38;bar") # => "foo&bar"
Actions #5

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Status changed from Open to Rejected

Updated by westoque (William Estoque) over 1 year ago

@ufuk (Ufuk Kayserilioglu) thank you for that explanation. I may have jumped to conclusions when checking that response in the browser (Chrome) vs curl which unescaped the characters.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like1Like0