Bug #2313
closedIncomplete encoding conversion?
Description
=begin
I get the following error in irb:
"http://localhost/posts/eeé".encode('ASCII-8BIT')
Encoding::UndefinedConversionError: "\xC3\xA9" from UTF-8 to ASCII-8BIT
from (irb):7:inencode' from (irb):7 from /opt/local/bin/irb:12:in
'
Is this a bug?
ASCII-8BIT is (as far as I understand it) essentially binary, so you should be able to convert any string to ASCII-8BIT.
=end
Updated by naruse (Yui NARUSE) about 15 years ago
- Status changed from Open to Rejected
That is not a conversion; that is setting an encoding.
So you should use String#force_encoding(enc)
.
Updated by adamsalter (Adam Salter) about 15 years ago
Ok. I'm still a little unclear.
The Ruby 1.9 docs say String#encode
"returns a copy of str transcoded to encoding 'encoding'".
From James Edward Grey article on strings String#force_encoding
'doesn't change the data at all, just the rules for interpreting that data'. So String#force_encoding
is not a conversion/transcoding.
Shouldn't you be able to String#encode
any string as ASCII-8BIT? (If not is there somewhere I can read up more on this?)
Updated by naruse (Yui NARUSE) about 15 years ago
The data of String consist from byte string and an encoding.
String#encode
changes both, but String#force_encoding
changes only its encoding.
You know, "converting to ASCII-8BIT" doesn't change its byte string,
so this is String#force_encoding
's business.
Updated by adamsalter (Adam Salter) about 15 years ago
OK. Thank you.
I do think it makes sense to be able to do:
>> "元気".encode('UTF-8').encode('ASCII-8BIT').encode('UTF-8')
.. even though it doesn't actually change the string bytes internally. But, I guess it's only for ASCII-8BIT that it would be necessary to use String#force_encoding
.
What about this?
>> "元気".encode('UTF-8').encode('SHIFT_JIS').encode('UTF-8')
=> "元気"
>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
from (irb):24:in `encode'
from (irb):24
from /opt/local/bin/irb:12:in `<main>'
Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?
Updated by naruse (Yui NARUSE) about 15 years ago
>> "元気".encode('UTF-8').force_encoding('ASCII-8BIT').encode('UTF-8') Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8 from (irb):24:in `encode' from (irb):24 from /opt/local/bin/irb:12:in `<main>'
Is that a bug in the UTF-8 encoding parser? Or is it related to this problem?
OK, I'll explain step by step:
str = "元気"
# You make a String which contains "元気" encode by some encoding
# str's byte data is some byte string which means "元気"
# str's encoding is a source encoding
str = str.encode('UTF-8')
# str is encoded to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is UTF-8
str.force_encoding('ASCII-8BIT')
# change str's encoding to ASCII-8BIT, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now ASCII-8BIT
Then you try str.encode('UTF-8') and this String#encode converts byte data:
String#encode try to convert "\xE5" from ASCII-8BIT to UTF-8, but there is no mapping.
What you want to do is not a conversion, it should be setting encoding.
str.force_encoding('UTF-8')
# change str's encoding to UTF-8, so
# str's byte data is "\xE5\x85\x83\xE6\xB0\x97"
# str's encoding is now UTF-8
Updated by adamsalter (Adam Salter) about 15 years ago
OK I understand now :) I was mixing up the available encoding converters... There is no Encoding::Converter
from UTF-8 to ASCII-8BIT (or visa versa ;).
Thank you for your patience.
Updated by duerst (Martin Dürst) about 15 years ago
Hello Adam,
On 2009/11/01 10:35, Adam Salter wrote:
Issue #2313 has been updated by Adam Salter.
OK I understand now :) I was mixing up the available encoding converters... There is no
Encoding::Converter
from UTF-8 to ASCII-8BIT (or visa versa ;).
No, there should be an Encoding::Converter
from UTF-8 to ASCII-8BIT (or
you should be able to create one). The underlying conversion table is
available. For example, the following works:
puts 'abc'.encode('UTF-8').encode('ASCII-8BIT')
=> abc
The reason this works is that ASCII-8BIT is defined to contain (7-bit)
ASCII. The fact that
"元気".encode('UTF-8').encode('ASCII-8BIT')
doesn't work is very similar to the fact that e.g.
"Dürst".encode('UTF-8').encode('shift_jis')
doesn't work: There is no "ü" character in Shift_JIS, and there is no
"元" character in ASCII-8BIT. So the transcoding engine has to give up,
usually with an exception. This can also be understood when noticing
that String#encode
tries to preserve character identity. If we just
copied arbitrary bytes into an ASCII-8BIT string, we would still have
the same bytes (you can do that with force-encoding), but the only thing
Ruby knows is that these are bytes, it has no idea which characters they
represent. That's why for removing such information (e.g. with
.force_encoding('ASCII-8BIT')
) as well as for adding such information
(e.g. with .force_encoding('UTF-8')
), we use a long and forceful method
name that should give programmers the message "watch out, you need to
know by yourself what you're doing".
Regards, Martin.
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp