Bug #10132: unpack() ignores default encoding when generating strings, always uses ASCII-8BIT - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #10132

closed

unpack() ignores default encoding when generating strings, always uses ASCII-8BIT

Added by meta (mathew murphy) almost 11 years ago. Updated almost 11 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN

[ruby-core:64359]

Description

New strings are generated in the default encoding:

irb> __ENCODING__.name
=> "UTF-8"
irb> "ünicode".encoding.name
=> "UTF-8"

...but not if they're generated by unpack:

irb> "ünicode".split.pack('M*').unpack('M*').first
=> "\xC3\xBCnicode"
irb> "ünicode".split.pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"

Workaround is to force the encoding on every string unpack generates:

irb> "ünicode".split.pack('M*').unpack('M*').first.force_encoding(__ENCODING__.name)
=> "ünicode"

Actions

Copy link

#1 [ruby-core:64360]

Updated by meta (mathew murphy) almost 11 years ago

In case there's confusion because of the strange splits in my examples:

["ünicode"].pack('M*').unpack('M*').first.encoding.name
=> "ASCII-8BIT"

Actions

Copy link

#2 [ruby-core:64368]

Updated by nobu (Nobuyoshi Nakada) almost 11 years ago

Status changed from Open to Rejected

pack("M*") (and pack("C*")) are for binary data primarily.

Actions

Copy link

#3 [ruby-core:64404]

Updated by meta (mathew murphy) almost 11 years ago

The Ruby documentation says:

M | String | quoted printable, MIME encoding (see RFC2045)

And RFC 2045 section 6.7 says:

The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set.

So the Ruby documentation itself says that it's a string not binary data, and it refers to an RFC that says the encoding is intended for textual (printable) characters.

Perhaps you were thinking of base64? I don't think I've ever seen quoted-printable used for binary data.

Actions

Copy link

#4 [ruby-core:64405]

Updated by meta (mathew murphy) almost 11 years ago

Now that I read the documentation on encodings more carefully, I think the real problem is more fundamental: __ENCODING__ doesn't determine the encoding of all created strings; it only affects strings created using string constants in the source code.

String.new.encoding => #<Encoding:ASCII-8BIT>
"".encoding         => #<Encoding:UTF-8>

So:

> String.new == ""
=> true
> String.new.encoding == "".encoding
=> false

So Ruby is actually behaving as documented, it's just that I find the behavior surprising. Maybe I'm alone in that, though.

Any chance we could have a way to specify a default encoding for all created strings?

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #10132

unpack() ignores default encoding when generating strings, always uses ASCII-8BIT

Updated by meta (mathew murphy) almost 11 years ago

Updated by nobu (Nobuyoshi Nakada) almost 11 years ago

Updated by meta (mathew murphy) almost 11 years ago

Updated by meta (mathew murphy) almost 11 years ago