Project

General

Profile

Actions

Feature #15517

open

Net::HTTP not recognizing valid UTF-8

Added by cohen (Cohen Carlisle) over 2 years ago. Updated 2 months ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:90931]

Description

I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ″), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while.

As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8.

The README from the repo (which shows the behavior) is reproduced below:

The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8.

In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded.

require 'net/http'
uri = URI('https://utf8app.herokuapp.com')
uri.path = '/utf8/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ″.
res.body.encoding
# => #<Encoding:ASCII-8BIT>
res.body.ascii_only?
# => false
'The symbol for the inch unit of measurement is ″.'.encoding
# => #<Encoding:UTF-8>

We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead.

uri.path = '/ascii/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ".
res.body.encoding
# => #<Encoding:UTF-8>
res.body.ascii_only?
# => true

Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example.

require 'json'
uri.path = '/utf8/example_json'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "application/json; charset=utf-8"
puts res.body
# {"feet":"′","inches":"″"}
res.body.encoding
# => #<Encoding:ASCII-8BIT>
json = JSON.parse(res.body)
# => {"feet"=>"′", "inches"=>"″"}
json.values.map { |v| [v.encoding.to_s, v] }
# => [["UTF-8", "′"], ["UTF-8", "″"]]

Related issues

Is duplicate of Ruby master - Feature #2567: Net::HTTP does not handle encoding correctlyAssignednaruse (Yui NARUSE)Actions
Actions #1

Updated by naruse (Yui NARUSE) over 2 years ago

  • Is duplicate of Feature #2567: Net::HTTP does not handle encoding correctly added

Updated by cohen (Cohen Carlisle) over 2 years ago

I'm not sure I think this is exactly the same as https://bugs.ruby-lang.org/issues/2567, as that one has focused on using the HTTP headers to guess the content type. Here I'm pointing out that ASCII-only strings are recognized as UTF8, but valid, multi-byte UTF8 strings are not recognized as UTF8 encoded. I suppose the trouble is that checking if the string is a valid UTF8 encoded string is not trivial, but other core/stdlib functions, like File.read seem to perform this.

Updated by duerst (Martin Dürst) over 2 years ago

I think this issue is not a duplicate of issue #2567, but it is clearly related.

Checking whether the string is valid UTF-8 is rather easy;

string.force_encoding('UTF-8').valid_encoding?

should do.

There are some security issues with browsers accepting certain mime types in certain encodings, and servers might mislabel stuff, but for text/plain; charset=utf-8, the text/plain part isn't a security issue, and text/plain doesn't have any way of indicating the encoding inside the document, so there should be no problems declaring the encoding of the resulting string as UTF-8.

Updated by jeremyevans0 (Jeremy Evans) 2 months ago

  • Backport deleted (2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN)
  • ruby -v deleted (2.6.0)
  • Assignee set to naruse (Yui NARUSE)
  • Status changed from Open to Assigned
  • Tracker changed from Bug to Feature

I've submitted a pull request (https://github.com/ruby/net-http/pull/17) that takes the patch provided by naruse (Yui NARUSE) in #2567 and modifies it to be opt-in in a backwards compatible manner. It also fixes various issues with the patch and adds some basic tests. There are definitely cases that would not be handled correctly in terms of detecting content through meta tags, but since it is opt-in it should not break existing code.

The current behavior is not considered a bug, so I'm switching this to a feature request, the same as #2567.

Actions

Also available in: Atom PDF