Bug #1965

the strange thing in Iconv under windows(GBK)

Added by junchen wu over 2 years ago. Updated 10 months ago.

[ruby-core:24990]
Status:Closed Start date:08/20/2009
Priority:High Due date:
Assignee:- % Done:

0%

Category:-
Target version:1.9.1
ruby -v:ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]

Description

I have a file encoding in utf-8,this is the content:

#掉
config

I read it and then match it with =~/ab/,it will raise: ArgumentError: invalid byte sequence in GBK.
There is something strange:
irb> s=IO.readlines('test.utf8').join
=> "#鎺\x89\nconfig"
irb> gbk=Iconv.conv('gbk','utf-8',s)
=> "#掉\nconfig"
irb> utf=Iconv.conv('utf-8','gbk',gbk)
=> "#鎺塡nconfig"
irb> s==utf
=> false   # in Ruby1.8.7,it will say true
irb> s=~/ab/
ArgumentError: invalid byte sequence in GBK
irb> utf=~/ab/
=> nil

my environment:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Windows XP,GBK,chcp=>936

test.utf8 - the utf-8 encoding string file (12 Bytes) junchen wu, 08/20/2009 04:43 pm

History

Updated by Yui NARUSE over 2 years ago

This seems to be caused by iconv library.
Please try another iconv.dll.

Updated by junchen wu over 2 years ago

Maybe I need try another iconv.dll to make the s==utf return true,but then both s=~/ab/ and utf=~/ab/ will raise the ArgumentError: invalid byte sequence in GBK.
I want to read my string from my utf-8 file,and compile it with regexp without raise error,this will work fine in Linux,but not work in my GBK windows.

Updated by Yui NARUSE over 2 years ago

Oh I see.
You should s=IO.readlines('test.utf8',:encoding=>'utf-8').join.
or s=IO.read('test.utf8',:encoding=>'utf-8')

Updated by junchen wu over 2 years ago

Thanks so much,it works fine now!
Is there some setting to make the IO read all files using :encoding=>'utf-8' by default,or should the IO check the file encoding and auto set this before read it?
Rails read files use File.read(),if must add :encoding=>'utf-8' to all the file reader,there will be lots of work to do;-)
Sorry for my pool known of ruby usage,thanks for your patient!

Updated by Yui NARUSE over 2 years ago

  • Status changed from Open to Closed
> Is there some setting to make the IO read all files using :encoding=>'utf-8' by default
Encoding.default_external gives the default.
Rails may use this and set as UTF-8, so you shouldn't change this.


following gives detailed information
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
http://blog.grayproductions.net/articles/understanding_m17n
http://github.com/candlerb/string19/blob/361c7d9acf1745006fb3f35e94a1ee844d0bff07/string19.rb

> should the IO check the file encoding and auto set this before read it?
'EncDet' is the one, but this is not merged yet because of naming problem.

These are written in Japanese, but you can see candidates.
If you have good name, suggest it.
http://redmine.ruby-lang.org/issues/show/973
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/33628

Anyway if you know the encoding of a file, to specify explicitly is safest way.

Also available in: Atom PDF