Project

General

Profile

Actions

Bug #1965

closed

the strange thing in Iconv under windows(GBK)

Added by phoenix (junchen wu) over 11 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
ruby -v:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Backport:
[ruby-core:24990]

Description

=begin
I have a file encoding in utf-8,this is the content:

#掉
config

I read it and then match it with =~/ab/,it will raise: ArgumentError: invalid byte sequence in GBK.
There is something strange:
irb> s=IO.readlines('test.utf8').join
=> "#鎺\x89\nconfig"
irb> gbk=Iconv.conv('gbk','utf-8',s)
=> "#掉\nconfig"
irb> utf=Iconv.conv('utf-8','gbk',gbk)
=> "#鎺塡nconfig"
irb> s==utf
=> false # in Ruby1.8.7,it will say true
irb> s=~/ab/
ArgumentError: invalid byte sequence in GBK
irb> utf=~/ab/
=> nil

my environment:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Windows XP,GBK,chcp=>936
=end


Files

test.utf8 (12 Bytes) test.utf8 the utf-8 encoding string file phoenix (junchen wu), 08/20/2009 04:43 PM
Actions #1

Updated by naruse (Yui NARUSE) over 11 years ago

=begin
This seems to be caused by iconv library.
Please try another iconv.dll.
=end

Actions #2

Updated by phoenix (junchen wu) over 11 years ago

=begin
Maybe I need try another iconv.dll to make the s==utf return true,but then both s=~/ab/ and utf=~/ab/ will raise the ArgumentError: invalid byte sequence in GBK.
I want to read my string from my utf-8 file,and compile it with regexp without raise error,this will work fine in Linux,but not work in my GBK windows.
=end

Actions #3

Updated by naruse (Yui NARUSE) over 11 years ago

=begin
Oh I see.
You should s=IO.readlines('test.utf8',:encoding=>'utf-8').join.
or s=IO.read('test.utf8',:encoding=>'utf-8')

=end

Actions #4

Updated by phoenix (junchen wu) over 11 years ago

=begin
Thanks so much,it works fine now!
Is there some setting to make the IO read all files using :encoding=>'utf-8' by default,or should the IO check the file encoding and auto set this before read it?
Rails read files use File.read(),if must add :encoding=>'utf-8' to all the file reader,there will be lots of work to do;-)
Sorry for my pool known of ruby usage,thanks for your patient!
=end

Actions #5

Updated by naruse (Yui NARUSE) over 11 years ago

  • Status changed from Open to Closed

=begin

Is there some setting to make the IO read all files using :encoding=>'utf-8' by default
Encoding.default_external gives the default.
Rails may use this and set as UTF-8, so you shouldn't change this.

following gives detailed information
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
http://blog.grayproductions.net/articles/understanding_m17n
http://github.com/candlerb/string19/blob/361c7d9acf1745006fb3f35e94a1ee844d0bff07/string19.rb

should the IO check the file encoding and auto set this before read it?
'EncDet' is the one, but this is not merged yet because of naming problem.

These are written in Japanese, but you can see candidates.
If you have good name, suggest it.
http://redmine.ruby-lang.org/issues/show/973
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/33628

Anyway if you know the encoding of a file, to specify explicitly is safest way.
=end

Actions

Also available in: Atom PDF