https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112010-11-02T20:54:06ZRuby Issue Tracking SystemRuby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140012010-11-02T20:54:06Ztenderlovemaking (Aaron Patterson)tenderlove@ruby-lang.org
<ul></ul><p>=begin<br>
On Mon, Nov 01, 2010 at 11:36:58AM +0900, Heesob Park wrote:</p>
<blockquote>
<p>Bug <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: YAML fails to roundtrip non ASCII String (Rejected)" href="https://bugs.ruby-lang.org/issues/4010">#4010</a>: YAML fails to roundtrip non ASCII String<br>
<a href="http://redmine.ruby-lang.org/issues/show/4010" class="external">http://redmine.ruby-lang.org/issues/show/4010</a></p>
<p>Author: Heesob Park<br>
Status: Open, Priority: Normal<br>
Category: lib, Target version: 1.9.x<br>
ruby -v: ruby 1.9.3dev (2010-11-01 trunk 29655) [i386-mswin32_90]</p>
<p>C:>ruby -v -ryaml -e 's="한글";p YAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-01 trunk 29655) [i386-mswin32_90]<br>
false</p>
<p>C:>ruby -v -ryaml -e 's="한글";p YAML.load(YAML.dump(s))==s'<br>
ruby 1.8.6 (2010-02-04 patchlevel 398) [i386-mingw32]<br>
true</p>
</blockquote>
<p>I'm pretty sure this is a known issue with Syck. Can you try to<br>
reproduce this with psych?</p>
<p>ruby -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";p YAML.load(YAML.dump(s))==s'</p>
<p>--<br>
Aaron Patterson<br>
<a href="http://tenderlovemaking.com/" class="external">http://tenderlovemaking.com/</a></p>
<p>Attachment: (unnamed)<br>
=end</p> Ruby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140042010-11-02T21:58:27Zphasis68 (Heesob Park)phasis@gmail.com
<ul></ul><p>=begin<br>
The same result with psych.</p>
<p>$ ruby -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
/usr/local/lib/ruby/1.9.1/psych/deprecated.rb:79: warning: method redefined; discarding old to_yaml_properties<br>
/usr/local/lib/ruby/1.9.1/syck/rubytypes.rb:13: warning: previous definition of to_yaml_properties was here<br>
false</p>
<p>FYI the current encoding is 'EUC-KR'.<br>
I know it works when the encoding is 'UTF-8'.</p>
<p>=end</p> Ruby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140202010-11-03T11:21:30Ztenderlovemaking (Aaron Patterson)tenderlove@ruby-lang.org
<ul></ul><p>=begin<br>
On Tue, Nov 02, 2010 at 09:58:27PM +0900, Heesob Park wrote:</p>
<blockquote>
<p>Issue <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: YAML fails to roundtrip non ASCII String (Rejected)" href="https://bugs.ruby-lang.org/issues/4010">#4010</a> has been updated by Heesob Park.</p>
<p>The same result with psych.</p>
<p>$ ruby -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
/usr/local/lib/ruby/1.9.1/psych/deprecated.rb:79: warning: method redefined; discarding old to_yaml_properties<br>
/usr/local/lib/ruby/1.9.1/syck/rubytypes.rb:13: warning: previous definition of to_yaml_properties was here<br>
false</p>
<p>FYI the current encoding is 'EUC-KR'.<br>
I know it works when the encoding is 'UTF-8'.</p>
</blockquote>
<p>I think the problem is that your default_internal encoding isn't being<br>
set. YAML is usually stored as UTF-8, but psych will automatically<br>
transcode the string to whatever your default_internal encoding is set<br>
to.</p>
<p>Maybe this script will help illustrate the problem:</p>
<pre><code> # coding: utf-8
require 'psych'
eucjp = "こんにちは!".encode('EUC-JP')
string = Psych.load(Psych.dump(eucjp))
p string.encoding # => #<Encoding:UTF-8>
p eucjp == string # => false
Encoding.default_internal = 'EUC-JP'
string = Psych.load(Psych.dump(eucjp))
p string.encoding # => #<Encoding:EUC-JP>
p eucjp == string # => true
</code></pre>
<p>Try running your Ruby like this:</p>
<a name="-ruby-EEUC-KREUC-KR-v-ryaml-e-YAMLENGINEyamler-psych-s한글pYAMLloadYAMLdumpss"></a>
<h2 >$ ruby -EEUC-KR:EUC-KR -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<a href="#-ruby-EEUC-KREUC-KR-v-ryaml-e-YAMLENGINEyamler-psych-s한글pYAMLloadYAMLdumpss" class="wiki-anchor">¶</a></h2>
<p>Aaron Patterson<br>
<a href="http://tenderlovemaking.com/" class="external">http://tenderlovemaking.com/</a></p>
<p>Attachment: (unnamed)<br>
=end</p> Ruby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140212010-11-03T11:36:46Zphasis68 (Heesob Park)phasis@gmail.com
<ul></ul><p>=begin<br>
2010/11/3 Aaron Patterson <a href="mailto:aaron@tenderlovemaking.com" class="email">aaron@tenderlovemaking.com</a>:</p>
<blockquote>
<p>On Tue, Nov 02, 2010 at 09:58:27PM +0900, Heesob Park wrote:</p>
<blockquote>
<p>Issue <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: YAML fails to roundtrip non ASCII String (Rejected)" href="https://bugs.ruby-lang.org/issues/4010">#4010</a> has been updated by Heesob Park.</p>
<p>The same result with psych.</p>
<p>$ ruby -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
/usr/local/lib/ruby/1.9.1/psych/deprecated.rb:79: warning: method redefined; discarding old to_yaml_properties<br>
/usr/local/lib/ruby/1.9.1/syck/rubytypes.rb:13: warning: previous definition of to_yaml_properties was here<br>
false</p>
<p>FYI the current encoding is 'EUC-KR'.<br>
I know it works when the encoding is 'UTF-8'.</p>
</blockquote>
<p>I think the problem is that your default_internal encoding isn't being<br>
set. YAML is usually stored as UTF-8, but psych will automatically<br>
transcode the string to whatever your default_internal encoding is set<br>
to.</p>
<p>Maybe this script will help illustrate the problem:</p>
<p> # coding: utf-8</p>
<p> require 'psych'</p>
<p> eucjp = "こんにちは!".encode('EUC-JP')<br>
string = Psych.load(Psych.dump(eucjp))</p>
<p> p string.encoding # => #<a href="Encoding:UTF-8" class="external">Encoding:UTF-8</a><br>
p eucjp == string # => false</p>
<p> Encoding.default_internal = 'EUC-JP'</p>
<p> string = Psych.load(Psych.dump(eucjp))<br>
p string.encoding # => #<a href="Encoding:EUC-JP" class="external">Encoding:EUC-JP</a><br>
p eucjp == string # => true</p>
<p>Try running your Ruby like this:</p>
<p> $ ruby -EEUC-KR:EUC-KR -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
Here is the result</p>
</blockquote>
<p>$ ruby -EEUC-KR:EUC-KR -v -ryaml -e 'YAML::ENGINE.yamler = "psych";<br>
s="한글";p YAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
true</p>
<p>Did you mean it is not a bug and I must specify the default external<br>
and internal character encodings?</p>
<p>$ irb<br>
irb(main):001:0> Encoding.default_external<br>
=> #<a href="Encoding:EUC-KR" class="external">Encoding:EUC-KR</a><br>
irb(main):002:0> Encoding.default_internal<br>
=> nil<br>
Why ruby cannot detect Encoding.default_internal ?</p>
<p>Regards,<br>
Park Heesob</p>
<p>=end</p> Ruby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140282010-11-04T01:02:45Ztenderlovemaking (Aaron Patterson)tenderlove@ruby-lang.org
<ul></ul><p>=begin<br>
On Wed, Nov 03, 2010 at 11:36:35AM +0900, Heesob Park wrote:</p>
<blockquote>
<p>2010/11/3 Aaron Patterson <a href="mailto:aaron@tenderlovemaking.com" class="email">aaron@tenderlovemaking.com</a>:</p>
<blockquote>
<p>On Tue, Nov 02, 2010 at 09:58:27PM +0900, Heesob Park wrote:</p>
<blockquote>
<p>Issue <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: YAML fails to roundtrip non ASCII String (Rejected)" href="https://bugs.ruby-lang.org/issues/4010">#4010</a> has been updated by Heesob Park.</p>
<p>The same result with psych.</p>
<p>$ ruby -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
/usr/local/lib/ruby/1.9.1/psych/deprecated.rb:79: warning: method redefined; discarding old to_yaml_properties<br>
/usr/local/lib/ruby/1.9.1/syck/rubytypes.rb:13: warning: previous definition of to_yaml_properties was here<br>
false</p>
<p>FYI the current encoding is 'EUC-KR'.<br>
I know it works when the encoding is 'UTF-8'.</p>
</blockquote>
<p>I think the problem is that your default_internal encoding isn't being<br>
set. YAML is usually stored as UTF-8, but psych will automatically<br>
transcode the string to whatever your default_internal encoding is set<br>
to.</p>
<p>Maybe this script will help illustrate the problem:</p>
<p> # coding: utf-8</p>
<p> require 'psych'</p>
<p> eucjp = "こんにちは!".encode('EUC-JP')<br>
string = Psych.load(Psych.dump(eucjp))</p>
<p> p string.encoding # => #<a href="Encoding:UTF-8" class="external">Encoding:UTF-8</a><br>
p eucjp == string # => false</p>
<p> Encoding.default_internal = 'EUC-JP'</p>
<p> string = Psych.load(Psych.dump(eucjp))<br>
p string.encoding # => #<a href="Encoding:EUC-JP" class="external">Encoding:EUC-JP</a><br>
p eucjp == string # => true</p>
<p>Try running your Ruby like this:</p>
<p> $ ruby -EEUC-KR:EUC-KR -v -ryaml -e 'YAML::ENGINE.yamler = "psych"; s="한글";pYAML.load(YAML.dump(s))==s'<br>
Here is the result</p>
</blockquote>
<p>$ ruby -EEUC-KR:EUC-KR -v -ryaml -e 'YAML::ENGINE.yamler = "psych";<br>
s="한글";p YAML.load(YAML.dump(s))==s'<br>
ruby 1.9.3dev (2010-11-02 trunk 29667) [i686-linux]<br>
true</p>
<p>Did you mean it is not a bug and I must specify the default external<br>
and internal character encodings?</p>
</blockquote>
<p>Yes. YAML is stored as UTF-8 or 16 (possibly 32 as well, depending on<br>
the spec you choose). Strings you pull out of Psych will have an<br>
encoding that matches the source document.</p>
<blockquote>
<p>$ irb<br>
irb(main):001:0> Encoding.default_external<br>
=> #<a href="Encoding:EUC-KR" class="external">Encoding:EUC-KR</a><br>
irb(main):002:0> Encoding.default_internal<br>
=> nil<br>
Why ruby cannot detect Encoding.default_internal ?</p>
</blockquote>
<p>"default_external" indicates the encoding that files on disk probably<br>
have. A good default is the OS setting.</p>
<p>magic comments indicate the encoding of string literals within that file.</p>
<p>"default_internal" indicates the encoding that you want strings internal<br>
to your programs to have. Making that decision is not so easy. Magic<br>
comments cannot be used because there can exist many magic comments in<br>
your programs.</p>
<p>"default_internal" should be used by things like database adapters (or<br>
in this case YAML parsers) where the encoding of the external entity may<br>
differ from the encoding the user wants to use. The external entity<br>
should transcode to the user's "default_internal" setting. Because of<br>
this logic, there exists another problem with setting "default_internal"<br>
to something for the user.</p>
<p>Here is an example of what <em>could</em> happen if default_internal was<br>
automatically set:</p>
<p>"default_internal" is automatically set to something, say Shift-JIS, and<br>
you load a YAML file. You know that your YAML file is stored as UTF-8,<br>
and yet when you output the encoding from your program, you're surprised<br>
to see it report Shift-JIS! You wanted the original encoding of the file,<br>
but now you must call encode() to get it back to UTF-8. Even worse,<br>
because you went from UTF-8 => Shift-JIS => UTF-8, now there may be data<br>
loss due to encoding round trip problems.</p>
<p>Hope that helps!</p>
<p>--<br>
Aaron Patterson<br>
<a href="http://tenderlovemaking.com/" class="external">http://tenderlovemaking.com/</a></p>
<p>Attachment: (unnamed)<br>
=end</p> Ruby master - Bug #4010: YAML fails to roundtrip non ASCII Stringhttps://bugs.ruby-lang.org/issues/4010?journal_id=140422010-11-05T00:16:57Ztenderlovemaking (Aaron Patterson)tenderlove@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>=begin<br>
Closing since Psych works as expected. "default_internal" must be set for strings to be automatically transcoded.<br>
=end</p>