Bug #2313: Incomplete encoding conversion? - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #2313

closed

Incomplete encoding conversion?

Added by adamsalter (Adam Salter) over 15 years ago. Updated about 14 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 1.9.1p243 (2009-07-16 revision 24175) [i386-darwin10]

Backport:

[ruby-core:26429]

Description

=begin
I get the following error in irb:

"http://localhost/posts/eeé".encode('ASCII-8BIT')
Encoding::UndefinedConversionError: "\xC3\xA9" from UTF-8 to ASCII-8BIT
from (irb):7:in encode' from (irb):7 from /opt/local/bin/irb:12:in '

Is this a bug?

ASCII-8BIT is (as far as I understand it) essentially binary, so you should be able to convert any string to ASCII-8BIT.
=end

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

Status changed from Open to Rejected

That is not a conversion; that is setting an encoding.
So you should use String#force_encoding(enc).

Actions

Copy link

Updated by adamsalter (Adam Salter) over 15 years ago

Ok. I'm still a little unclear.

The Ruby 1.9 docs say String#encode "returns a copy of str transcoded to encoding 'encoding'".

From James Edward Grey article on strings String#force_encoding 'doesn't change the data at all, just the rules for interpreting that data'. So String#force_encoding is not a conversion/transcoding.

Shouldn't you be able to String#encode any string as ASCII-8BIT? (If not is there somewhere I can read up more on this?)

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

The data of String consist from byte string and an encoding.

String#encode changes both, but String#force_encoding changes only its encoding.

You know, "converting to ASCII-8BIT" doesn't change its byte string,
so this is String#force_encoding's business.

Actions

Copy link

Updated by adamsalter (Adam Salter) over 15 years ago

OK. Thank you.

I do think it makes sense to be able to do:

>> "元気".encode('UTF-8').encode('ASCII-8BIT').encode('UTF-8')

.. even though it doesn't actually change the string bytes internally. But, I guess it's only for ASCII-8BIT that it would be necessary to use String#force_encoding.

What about this?

>> "元気".encode('UTF-8').encode('SHIFT_JIS').encode('UTF-8')
=> "元気"
>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
	from (irb):24:in `encode'
	from (irb):24
	from /opt/local/bin/irb:12:in `<main>'

Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?

Actions

Copy link

Updated by naruse (Yui NARUSE) over 15 years ago

>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
	from (irb):24:in `encode'
	from (irb):24
	from /opt/local/bin/irb:12:in `<main>'

Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?

OK, I'll explain step by step:

str = "元気"
# You make a String which contains "元気" encode by some encoding
# str's byte data is some byte string which means "元気"
# str's encoding is a source encoding
str = str.encode('UTF-8')
# str is encoded to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is UTF-8
str.force_encoding('ASCII-8BIT')
# change str's encoding to ASCII-8BIT, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now ASCII-8BIT

Then you try str.encode('UTF-8') and this String#encode converts byte data:
String#encode try to convert "\xE5" from ASCII-8BIT to UTF-8, but there is no mapping.
What you want to do is not a conversion, it should be setting encoding.

str.force_encoding('UTF-8')
# change str's encoding to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now UTF-8

Actions

Copy link

Updated by adamsalter (Adam Salter) over 15 years ago

OK I understand now :) I was mixing up the available encoding converters... There is no Encoding::Converter from UTF-8 to ASCII-8BIT (or visa versa ;).

Thank you for your patience.

Actions

Copy link

Updated by duerst (Martin Dürst) over 15 years ago

Hello Adam,

On 2009/11/01 10:35, Adam Salter wrote:

Issue #2313 has been updated by Adam Salter.

OK I understand now :) I was mixing up the available encoding converters... There is no Encoding::Converter from UTF-8 to ASCII-8BIT (or visa versa ;).

No, there should be an Encoding::Converter from UTF-8 to ASCII-8BIT (or
you should be able to create one). The underlying conversion table is
available. For example, the following works:

puts 'abc'.encode('UTF-8').encode('ASCII-8BIT')
=> abc

The reason this works is that ASCII-8BIT is defined to contain (7-bit)
ASCII. The fact that

"元気".encode('UTF-8').encode('ASCII-8BIT')

doesn't work is very similar to the fact that e.g.

"Dürst".encode('UTF-8').encode('shift_jis')

doesn't work: There is no "ü" character in Shift_JIS, and there is no
"元" character in ASCII-8BIT. So the transcoding engine has to give up,
usually with an exception. This can also be understood when noticing
that String#encode tries to preserve character identity. If we just
copied arbitrary bytes into an ASCII-8BIT string, we would still have
the same bytes (you can do that with force-encoding), but the only thing
Ruby knows is that these are bytes, it has no idea which characters they
represent. That's why for removing such information (e.g. with
.force_encoding('ASCII-8BIT')) as well as for adding such information
(e.g. with .force_encoding('UTF-8')), we use a long and forceful method
name that should give programmers the message "watch out, you need to
know by yourself what you're doing".

Regards, Martin.

#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #2313

Incomplete encoding conversion?

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by adamsalter (Adam Salter) over 15 years ago

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by adamsalter (Adam Salter) over 15 years ago

Updated by naruse (Yui NARUSE) over 15 years ago

Updated by adamsalter (Adam Salter) over 15 years ago

Updated by duerst (Martin Dürst) over 15 years ago