Bug #10132
closedunpack() ignores default encoding when generating strings, always uses ASCII-8BIT
Description
New strings are generated in the default encoding:
irb> __ENCODING__.name
=> "UTF-8"
irb> "ünicode".encoding.name
=> "UTF-8"
...but not if they're generated by unpack:
irb> "ünicode".split.pack('M*').unpack('M*').first
=> "\xC3\xBCnicode"
irb> "ünicode".split.pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"
Workaround is to force the encoding on every string unpack generates:
irb> "ünicode".split.pack('M*').unpack('M*').first.force_encoding(__ENCODING__.name)
=> "ünicode"
Updated by meta (mathew murphy) over 10 years ago
In case there's confusion because of the strange splits in my examples:
["ünicode"].pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"
Updated by nobu (Nobuyoshi Nakada) over 10 years ago
- Status changed from Open to Rejected
pack("M*")
(and pack("C*")
) are for binary data primarily.
Updated by meta (mathew murphy) over 10 years ago
The Ruby documentation says:
M | String | quoted printable, MIME encoding (see RFC2045)
And RFC 2045 section 6.7 says:
The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set.
So the Ruby documentation itself says that it's a string not binary data, and it refers to an RFC that says the encoding is intended for textual (printable) characters.
Perhaps you were thinking of base64? I don't think I've ever seen quoted-printable used for binary data.
Updated by meta (mathew murphy) over 10 years ago
Now that I read the documentation on encodings more carefully, I think the real problem is more fundamental: __ENCODING__
doesn't determine the encoding of all created strings; it only affects strings created using string constants in the source code.
String.new.encoding => #<Encoding:ASCII-8BIT>
"".encoding => #<Encoding:UTF-8>
So:
> String.new == ""
=> true
> String.new.encoding == "".encoding
=> false
So Ruby is actually behaving as documented, it's just that I find the behavior surprising. Maybe I'm alone in that, though.
Any chance we could have a way to specify a default encoding for all created strings?