Bug #1965
the strange thing in Iconv under windows(GBK)
| Status: | Closed | Start date: | 08/20/2009 | |
|---|---|---|---|---|
| Priority: | High | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | - | |||
| Target version: | 1.9.1 | |||
| ruby -v: | ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32] |
Description
I have a file encoding in utf-8,this is the content:
#掉
config
I read it and then match it with =~/ab/,it will raise: ArgumentError: invalid byte sequence in GBK.
There is something strange:
irb> s=IO.readlines('test.utf8').join
=> "#鎺\x89\nconfig"
irb> gbk=Iconv.conv('gbk','utf-8',s)
=> "#掉\nconfig"
irb> utf=Iconv.conv('utf-8','gbk',gbk)
=> "#鎺塡nconfig"
irb> s==utf
=> false # in Ruby1.8.7,it will say true
irb> s=~/ab/
ArgumentError: invalid byte sequence in GBK
irb> utf=~/ab/
=> nil
my environment:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Windows XP,GBK,chcp=>936
History
Updated by Yui NARUSE over 2 years ago
This seems to be caused by iconv library. Please try another iconv.dll.
Updated by junchen wu over 2 years ago
Maybe I need try another iconv.dll to make the s==utf return true,but then both s=~/ab/ and utf=~/ab/ will raise the ArgumentError: invalid byte sequence in GBK. I want to read my string from my utf-8 file,and compile it with regexp without raise error,this will work fine in Linux,but not work in my GBK windows.
Updated by Yui NARUSE over 2 years ago
Oh I see.
You should s=IO.readlines('test.utf8',:encoding=>'utf-8').join.
or s=IO.read('test.utf8',:encoding=>'utf-8')
Updated by junchen wu over 2 years ago
Thanks so much,it works fine now! Is there some setting to make the IO read all files using :encoding=>'utf-8' by default,or should the IO check the file encoding and auto set this before read it? Rails read files use File.read(),if must add :encoding=>'utf-8' to all the file reader,there will be lots of work to do;-) Sorry for my pool known of ruby usage,thanks for your patient!
Updated by Yui NARUSE over 2 years ago
- Status changed from Open to Closed
> Is there some setting to make the IO read all files using :encoding=>'utf-8' by default Encoding.default_external gives the default. Rails may use this and set as UTF-8, so you shouldn't change this. following gives detailed information http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html http://blog.grayproductions.net/articles/understanding_m17n http://github.com/candlerb/string19/blob/361c7d9acf1745006fb3f35e94a1ee844d0bff07/string19.rb > should the IO check the file encoding and auto set this before read it? 'EncDet' is the one, but this is not merged yet because of naming problem. These are written in Japanese, but you can see candidates. If you have good name, suggest it. http://redmine.ruby-lang.org/issues/show/973 http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/33628 Anyway if you know the encoding of a file, to specify explicitly is safest way.