Project

General

Profile

Bug #2313

Incomplete encoding conversion?

Added by adamsalter (Adam Salter) about 10 years ago. Updated over 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 1.9.1p243 (2009-07-16 revision 24175) [i386-darwin10]
Backport:
[ruby-core:26429]

Description

=begin
I get the following error in irb:

"http://localhost/posts/eeé".encode('ASCII-8BIT')
Encoding::UndefinedConversionError: "\xC3\xA9" from UTF-8 to ASCII-8BIT
from (irb):7:in encode'
from (irb):7
from /opt/local/bin/irb:12:in
'

Is this a bug?

ASCII-8BIT is (as far as I understand it) essentially binary, so you should be able to convert any string to ASCII-8BIT.
=end


Related issues

Has duplicate Ruby master - Bug #2411: String#encode fails but eval("#coding:") worksRejected11/29/2009Actions

History

#1

Updated by naruse (Yui NARUSE) about 10 years ago

  • Status changed from Open to Rejected

That is not a conversion; that is setting an encoding.
So you should use String#force_encoding(enc).

#2

Updated by adamsalter (Adam Salter) about 10 years ago

Ok. I'm still a little unclear.

The Ruby 1.9 docs say String#encode "returns a copy of str transcoded to encoding 'encoding'".

From James Edward Grey article on strings String#force_encoding 'doesn't change the data at all, just the rules for interpreting that data'. So String#force_encoding is not a conversion/transcoding.

Shouldn't you be able to String#encode any string as ASCII-8BIT? (If not is there somewhere I can read up more on this?)

#3

Updated by naruse (Yui NARUSE) about 10 years ago

The data of String consist from byte string and an encoding.

String#encode changes both, but String#force_encoding changes only its encoding.

You know, "converting to ASCII-8BIT" doesn't change its byte string,
so this is String#force_encoding's business.

#4

Updated by adamsalter (Adam Salter) about 10 years ago

OK. Thank you.

I do think it makes sense to be able to do:

>> "元気".encode('UTF-8').encode('ASCII-8BIT').encode('UTF-8')

.. even though it doesn't actually change the string bytes internally. But, I guess it's only for ASCII-8BIT that it would be necessary to use String#force_encoding.

What about this?

>> "元気".encode('UTF-8').encode('SHIFT_JIS').encode('UTF-8')
=> "元気"
>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
    from (irb):24:in `encode'
    from (irb):24
    from /opt/local/bin/irb:12:in `<main>'

Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?

#5

Updated by naruse (Yui NARUSE) about 10 years ago

>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
  from (irb):24:in `encode'
  from (irb):24
  from /opt/local/bin/irb:12:in `<main>'

Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?

OK, I'll explain step by step:

str = "元気"
# You make a String which contains "元気" encode by some encoding
# str's byte data is some byte string which means "元気"
# str's encoding is a source encoding
str = str.encode('UTF-8')
# str is encoded to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is UTF-8
str.force_encoding('ASCII-8BIT')
# change str's encoding to ASCII-8BIT, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now ASCII-8BIT

Then you try str.encode('UTF-8') and this String#encode converts byte data:
String#encode try to convert "\xE5" from ASCII-8BIT to UTF-8, but there is no mapping.
What you want to do is not a conversion, it should be setting encoding.

str.force_encoding('UTF-8')
# change str's encoding to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now UTF-8
#6

Updated by adamsalter (Adam Salter) about 10 years ago

OK I understand now :) I was mixing up the available encoding converters... There is no Encoding::Converter from UTF-8 to ASCII-8BIT (or visa versa ;).

Thank you for your patience.

#7

Updated by duerst (Martin Dürst) about 10 years ago

Hello Adam,

On 2009/11/01 10:35, Adam Salter wrote:

Issue #2313 has been updated by Adam Salter.

OK I understand now :) I was mixing up the available encoding converters... There is no Encoding::Converter from UTF-8 to ASCII-8BIT (or visa versa ;).

No, there should be an Encoding::Converter from UTF-8 to ASCII-8BIT (or
you should be able to create one). The underlying conversion table is
available. For example, the following works:

puts 'abc'.encode('UTF-8').encode('ASCII-8BIT')
=> abc

The reason this works is that ASCII-8BIT is defined to contain (7-bit)
ASCII. The fact that

"元気".encode('UTF-8').encode('ASCII-8BIT')

doesn't work is very similar to the fact that e.g.

"Dürst".encode('UTF-8').encode('shift_jis')

doesn't work: There is no "ü" character in Shift_JIS, and there is no
"元" character in ASCII-8BIT. So the transcoding engine has to give up,
usually with an exception. This can also be understood when noticing
that String#encode tries to preserve character identity. If we just
copied arbitrary bytes into an ASCII-8BIT string, we would still have
the same bytes (you can do that with force-encoding), but the only thing
Ruby knows is that these are bytes, it has no idea which characters they
represent. That's why for removing such information (e.g. with
.force_encoding('ASCII-8BIT')) as well as for adding such information
(e.g. with .force_encoding('UTF-8')), we use a long and forceful method
name that should give programmers the message "watch out, you need to
know by yourself what you're doing".

Regards, Martin.


#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Also available in: Atom PDF