Project

General

Profile

Actions

Bug #10132

closed

unpack() ignores default encoding when generating strings, always uses ASCII-8BIT

Added by meta (mathew murphy) about 7 years ago. Updated about 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]
[ruby-core:64359]

Description

New strings are generated in the default encoding:

irb> __ENCODING__.name
=> "UTF-8"
irb> "ünicode".encoding.name
=> "UTF-8"

...but not if they're generated by unpack:

irb> "ünicode".split.pack('M*').unpack('M*').first
=> "\xC3\xBCnicode"
irb> "ünicode".split.pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"

Workaround is to force the encoding on every string unpack generates:

irb> "ünicode".split.pack('M*').unpack('M*').first.force_encoding(__ENCODING__.name)
=> "ünicode"

Updated by meta (mathew murphy) about 7 years ago

In case there's confusion because of the strange splits in my examples:

["ünicode"].pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"

Updated by nobu (Nobuyoshi Nakada) about 7 years ago

  • Status changed from Open to Rejected

pack("M*") (and pack("C*")) are for binary data primarily.

Updated by meta (mathew murphy) about 7 years ago

The Ruby documentation says:

M | String | quoted printable, MIME encoding (see RFC2045)

And RFC 2045 section 6.7 says:

The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set.

So the Ruby documentation itself says that it's a string not binary data, and it refers to an RFC that says the encoding is intended for textual (printable) characters.

Perhaps you were thinking of base64? I don't think I've ever seen quoted-printable used for binary data.

Updated by meta (mathew murphy) about 7 years ago

Now that I read the documentation on encodings more carefully, I think the real problem is more fundamental: __ENCODING__ doesn't determine the encoding of all created strings; it only affects strings created using string constants in the source code.

String.new.encoding => #<Encoding:ASCII-8BIT>
"".encoding         => #<Encoding:UTF-8>

So:

> String.new == ""
=> true
> String.new.encoding == "".encoding
=> false

So Ruby is actually behaving as documented, it's just that I find the behavior surprising. Maybe I'm alone in that, though.

Any chance we could have a way to specify a default encoding for all created strings?

Actions

Also available in: Atom PDF