Feature #5248

Faster PStore

Added by Masaki Matsushita over 3 years ago. Updated about 3 years ago.

[ruby-core:39172]
Status:Closed
Priority:Normal
Assignee:Hiroshi Nakamura

Description

=begin
Hellow.

I wrote a patch to make PStore more faster.
What I did as follows:

:deferred check sum calculation

PStore judges whether it should modify database file or not by 2 steps.
First, it compares data size between database file and marshal data to write.
If both size are different, it writes data to database file.
Second, if both sizes are same, it compares Digest::MD5.digest of both still more.

However, PStore calculates a check sum of data to write before size comparison.
If PStore can judge it should modify database file in size comparison, this calculation will be useless.
Consequently, I modified PStore to calculate a check sum after size comparison.

:check-sum calculation by String#sum

As stated above, Present PStore calculates checksum of database file by Digest::MD5.digest to judge whether it should modify file or not.
However, Digest::MD5.digest is cryptographic hash function and I think it is too strong to use as a mere check sum.
Therefore, I modified PStore to use not Digest::MD5.digest but String#sum.

:deferred File#truncate

PStore puts back file pointer to the head and truncates file size to zero before writing to database file as below.

file.rewind
file.truncate(0)
file.write(data)
(pstore.rb at line 486~488)

However, truncation by File#truncate is slow and it is the bottleneck of PStore.
I modified it as below.

file.rewind
file.write(data)
file.truncate(data.size)

It only puts back file pointer before write. Truncation is done after writing.
In this way, size needs to be truncate will be minimum and it makes PStore faster.

:performance

I benchmarked PStore as below:

require 'pstore'

p = PStore.new("foo")
p.transaction { p["hoge"] = "hoge" * ARGV.first.to_i }

10000.times do
  p.transaction { p["hoge"] += "hoge" }
end

Present PStore:

% time ruby pstore_bench.rb 1000
ruby pstore_bench.rb 1000  2.94s user 2.43s system 69% cpu 7.723 total
% time ruby pstore_bench.rb 10000
ruby pstore_bench.rb 10000  5.37s user 2.99s system 70% cpu 11.810 total
% time ruby pstore_bench.rb 100000
ruby pstore_bench.rb 100000  31.98s user 11.09s system 69% cpu 1:02.15 total

New PStore:

% time ruby pstore_bench.rb 1000
ruby pstore_bench.rb 1000  1.67s user 0.44s system 99% cpu 2.119 total
% time ruby pstore_bench.rb 10000
ruby pstore_bench.rb 10000  3.24s user 0.63s system 99% cpu 3.876 total
% time ruby pstore_bench.rb 100000
ruby pstore_bench.rb 100000  14.29s user 3.13s system 100% cpu 17.416 total

As a result, new PStore is faster.
It can be said that new PStore is the faster, the bigger database file is.
I attached a patch. PStore applied the patch passes test-all.
=end

patch.diff Magnifier (4.77 KB) Masaki Matsushita, 08/29/2011 11:52 AM

patch.diff Magnifier (4.5 KB) Masaki Matsushita, 12/20/2011 03:22 PM

perf.diff Magnifier - performance optimizations (2.18 KB) Masaki Matsushita, 12/20/2011 04:52 PM

cosmetic.diff Magnifier - trivial cosmetic changes (2.4 KB) Masaki Matsushita, 12/20/2011 04:52 PM

Associated revisions

Revision 34083
Added by Hiroshi Nakamura about 3 years ago

  • PStore content update perf optimization. Patch by Masaki Matsushita.
    See #5248.

  • lib/pstore.rb (save_data):

    • Delete inadequate Marshal check.
    • Deferred file truncation: when writing the new content, truncate the saved file to the data size after writing the data, instead of truncating whole bytes before writing data.
    • Deferred MD5 calculation: when comparing MD5 hash to check the content modification, calculate MD5 hash of new data iif the content length is differ from the old one.
    • Compare content size with String#bytesize instead of String#size.

Revision 34083
Added by Hiroshi Nakamura about 3 years ago

  • PStore content update perf optimization. Patch by Masaki Matsushita.
    See #5248.

  • lib/pstore.rb (save_data):

    • Delete inadequate Marshal check.
    • Deferred file truncation: when writing the new content, truncate the saved file to the data size after writing the data, instead of truncating whole bytes before writing data.
    • Deferred MD5 calculation: when comparing MD5 hash to check the content modification, calculate MD5 hash of new data iif the content length is differ from the old one.
    • Compare content size with String#bytesize instead of String#size.

Revision 34084
Added by Hiroshi Nakamura about 3 years ago

Cosmetic changes of lib/pstore.rb. Patch by Masaki Matsushita. See #5248.

Revision 34084
Added by Hiroshi Nakamura about 3 years ago

Cosmetic changes of lib/pstore.rb. Patch by Masaki Matsushita. See #5248.

History

#1 Updated by Yui NARUSE over 3 years ago

I agree with your points except following one.

Masaki Matsushita wrote:

  • check-sum calculation by String#sum

As stated above, Present PStore calculates checksum of database file by Digest::MD5.digest to judge whether it should modify file or not.
However, Digest::MD5.digest is cryptographic hash function and I think it is too strong to use as a mere check sum.
Therefore, I modified PStore to use not Digest::MD5.digest but String#sum.

When the checksum of your data accidentally equals to old data, PStore won't save the data and you lose it.

#2 Updated by Kenta Murata over 3 years ago

Yui NARUSE wrote:

When the checksum of your data accidentally equals to old data, PStore won't save the data and you lose it.

The accident you mentioned seems to occur in the current implementation probabilistically.
It will lead troublesome bugs. I believe it should be fixed.

#3 Updated by Yui NARUSE over 3 years ago

Kenta Murata wrote:

Yui NARUSE wrote:

When the checksum of your data accidentally equals to old data, PStore won't save the data and you lose it.

The accident you mentioned seems to occur in the current implementation probabilistically.
It will lead troublesome bugs. I believe it should be fixed.

I think it is OK until the probability is enough small and can't be attacked.

#4 Updated by Masaki Matsushita over 3 years ago

Kenta Murata wrote:

The accident you mentioned seems to occur in the current implementation probabilistically.

As long as PStore uses any check sums or any cryptographic hash functions, there is possibility of collision.
To avoid such bugs, it is needed to compare the whole data or to write data in every time.

#5 Updated by Masaki Matsushita about 3 years ago

I wrote a patch using Digest::MD5.digest instead of String#sum.
I think it is enough to avoid collision.

#6 Updated by Hiroshi Nakamura about 3 years ago

Matsushita-san, would you please split the patch into 2 parts, one for perf optimizations and another for trivial cosmetic changes?

There's no pstore.rb maintainer now but I think I can understand perf fixes and merge it for actual pstore.rb users like you.

#7 Updated by Masaki Matsushita about 3 years ago

Yes.
I split the patch.

#8 Updated by Hiroshi Nakamura about 3 years ago

=begin
Thanks! perf.diff looks OK to me. I'm evaluating this. cosmetic.diff would be handled later.

require 'pstore'
require 'benchmark'

def run(size)
file = "pstore_#{size}"
File.unlink(file) if File.exist?(file)
store = PStore.new(file)
store.transaction do
store["hoge"] = "hoge" * size
end
1000.times do
store.transaction do
store["hoge"] += "hoge"
end
end
end

Benchmark.bm(6) do |bm|
[1000, 10000, 100000].each do |size|
bm.report(size.to_s) { run(size) }
end
end

So far, file truncation tweak is the key.

Original:
# user system total real
1000 0.120000 0.230000 0.350000 ( 0.481725)
10000 0.260000 0.350000 0.610000 ( 0.705170)
100000 3.500000 1.840000 5.340000 ( 18.753409)

With truncation tweak:
# user system total real
1000 0.080000 0.030000 0.110000 ( 0.110230)
10000 0.300000 0.010000 0.310000 ( 0.307720)
100000 2.230000 0.480000 2.710000 ( 2.715860)

With deferred MD5 calculation:
# user system total real
1000 0.050000 0.230000 0.280000 ( 0.417151)
10000 0.120000 0.410000 0.530000 ( 0.804206)
100000 2.460000 1.750000 4.210000 ( 18.411857)

With both (full patch):
# user system total real
1000 0.070000 0.040000 0.110000 ( 0.112234)
10000 0.180000 0.050000 0.230000 ( 0.235963)
100000 1.490000 0.480000 1.970000 ( 1.963404)

I'll merge it tonight.
=end

#9 Updated by Shota Fukumori about 3 years ago

  • Assignee set to Hiroshi Nakamura

Nice performance tweak :)

#10 Updated by Shota Fukumori about 3 years ago

  • Status changed from Open to Assigned

#11 Updated by Hiroshi Nakamura about 3 years ago

  • Status changed from Assigned to Closed

Applied to trunk at r34083 (perf.diff) and r34084 (cosmetic.diff), excluding $0 -> $PROGRAM_NAME change because other libs are using $0. I'm closing this as 'fixed'.

Thanks for the patch! We needed your eyes, I even didn't know we have that unnecessary 'marshal_dump_supports_canonical_option?' method in it...

#12 Updated by Yui NARUSE about 3 years ago

Masaki Matsushita wrote:

Kenta Murata wrote:

The accident you mentioned seems to occur in the current implementation probabilistically.

As long as PStore uses any check sums or any cryptographic hash functions, there is possibility of collision.
To avoid such bugs, it is needed to compare the whole data or to write data in every time.

About such possibility, see http://www.radiumsoftware.com/0406.html#040630

#13 Updated by Masaki Matsushita about 3 years ago

Thank you, Nakamura-san.
I'm glad pstore.rb became faster.

Yui NARUSE wrote:

About such possibility, see http://www.radiumsoftware.com/0406.html#040630
Thank you for letting me know.

Also available in: Atom PDF