Bug #10111
gdbm truncated UTF-8 data problem
Description
Reproducible script is here.
# coding: utf-8 require 'gdbm' data = "\xEA\xB0\x80ABCDEF" db = GDBM.new( 'test.db', 0666 ) db['key'] = data throw 'data truncated!!' if db['key'] != data
History
#1
[ruby-core:64215]
Updated by nobu (Nobuyoshi Nakada) over 3 years ago
gdbm doesn't preserve encodings now.
#2
[ruby-core:64217]
Updated by testors (KiHyun Kang) over 3 years ago
Nobuyoshi Nakada wrote:
gdbm doesn't preserve encodings now.
gdbm doesn't have to preserve encodings.
ext/dbm works well but ext/gdbm because ext/gdbm is using 'length' to get size.
'length' is not suitable to determine actual size.
use 'bytesize' instead of 'length'.
#3
[ruby-core:64218]
Updated by nobu (Nobuyoshi Nakada) over 3 years ago
KiHyun Kang wrote:
Nobuyoshi Nakada wrote:
gdbm doesn't preserve encodings now.
gdbm doesn't have to preserve encodings.
$ ./ruby -v -rgdbm -e 'data = "\xEA\xB0\x80ABCDEF"' -e 'db = GDBM.new("test.db", 0666)' -e 'db["key"] = data' -e 'p db["key"] == data.b' ruby 2.1.2p195 (2014-08-04 revision 47056) [x86_64-darwin13.0] true
ext/dbm works well but ext/gdbm because ext/gdbm is using 'length' to get size.
'length' is not suitable to determine actual size.
use 'bytesize' instead of 'length'.
I can't understand what you mean at all.
#4
[ruby-core:64408]
Updated by akr (Akira Tanaka) over 3 years ago
The data is not truncated but has a different encoding (as nobu pointed at first).
% cat t.gdbm.rb # coding: utf-8 require 'gdbm' data = "\xEA\xB0\x80ABCDEF" db = GDBM.new( 'test.db', 0666 ) db['key'] = data p [db['key'].b, db['key'].encoding] p [data.b, data.encoding] throw 'data truncated!!' if db['key'] != data % ./ruby -v t.gdbm.rb ruby 2.2.0dev (2014-08-15 trunk 47187) [x86_64-linux] ["\xEA\xB0\x80ABCDEF", #<Encoding:ASCII-8BIT>] ["\xEA\xB0\x80ABCDEF", #<Encoding:UTF-8>] t.gdbm.rb:10:in `throw': uncaught throw "data truncated!!" (ArgumentError) from t.gdbm.rb:10:in `<main>'
dbm behaves same as gdbm.
% cat t.dbm.rb # coding: utf-8 require 'dbm' data = "\xEA\xB0\x80ABCDEF" db = DBM.new( 'test.db', 0666 ) db['key'] = data p [db['key'].b, db['key'].encoding] p [data.b, data.encoding] throw 'data truncated!!' if db['key'] != data % ./ruby -v t.dbm.rb ruby 2.2.0dev (2014-08-15 trunk 47187) [x86_64-linux] ["\xEA\xB0\x80ABCDEF", #<Encoding:ASCII-8BIT>] ["\xEA\xB0\x80ABCDEF", #<Encoding:UTF-8>] t.dbm.rb:10:in `throw': uncaught throw "data truncated!!" (ArgumentError) from t.dbm.rb:10:in `<main>'
#5
Updated by akr (Akira Tanaka) about 3 years ago
- Status changed from Open to Rejected
gdbm (and dbm) doesn't record encoding.
So, current behavior is natural and not a bug, I think.