Bug #10132


unpack() ignores default encoding when generating strings, always uses ASCII-8BIT

Added by meta (mathew murphy) about 7 years ago. Updated about 7 years ago.

ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]


New strings are generated in the default encoding:

=> "UTF-8"
irb> "ünicode"
=> "UTF-8"

...but not if they're generated by unpack:

irb> "ünicode".split.pack('M*').unpack('M*').first
=> "\xC3\xBCnicode"
irb> "ünicode".split.pack('M*').unpack('M*')

Workaround is to force the encoding on every string unpack generates:

irb> "ünicode".split.pack('M*').unpack('M*').first.force_encoding(
=> "ünicode"

Updated by meta (mathew murphy) about 7 years ago

In case there's confusion because of the strange splits in my examples:


Updated by nobu (Nobuyoshi Nakada) about 7 years ago

  • Status changed from Open to Rejected

pack("M*") (and pack("C*")) are for binary data primarily.

Updated by meta (mathew murphy) about 7 years ago

The Ruby documentation says:

M | String | quoted printable, MIME encoding (see RFC2045)

And RFC 2045 section 6.7 says:

The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set.

So the Ruby documentation itself says that it's a string not binary data, and it refers to an RFC that says the encoding is intended for textual (printable) characters.

Perhaps you were thinking of base64? I don't think I've ever seen quoted-printable used for binary data.

Updated by meta (mathew murphy) about 7 years ago

Now that I read the documentation on encodings more carefully, I think the real problem is more fundamental: __ENCODING__ doesn't determine the encoding of all created strings; it only affects strings created using string constants in the source code. => #<Encoding:ASCII-8BIT>
"".encoding         => #<Encoding:UTF-8>


> == ""
=> true
> == "".encoding
=> false

So Ruby is actually behaving as documented, it's just that I find the behavior surprising. Maybe I'm alone in that, though.

Any chance we could have a way to specify a default encoding for all created strings?


