Bug #9229

[patch] expose rb_fstring() as String#dedup

Added by Aman Gupta 4 months ago. Updated 4 months ago.

[ruby-core:58963]
Status:Closed
Priority:Normal
Assignee:Yukihiro Matsumoto
Category:-
Target version:2.1.0
ruby -v:trunk Backport:1.9.3: UNKNOWN, 2.0.0: UNKNOWN

Description

After recent commits, ruby is using the new rb_fstring() API extensively inside the VM to de-duplicate internal strings.
This technique has proven very successful, and reduced the majority of long-lived strings in large applications.

I think we should expose this functionality to ruby as well.

This api would allow gem/library maintainers to de-duplicate strings in any long-lived objects they create.
For example, many gems today contain large constant lookup tables that contain many strings. These tables are often loaded via yaml or json from disk:

Addressable::IDNA::UNICODEDATA
MIME::Types.instance
variableget(:@types)
TZInfo::Timezone.class
variableget(:@@loadedzones)
ActiveSupport::Multibyte::UCD
TTFunk::Table::Post::Format10::POSTSCRIPTGLYPHS
Money::Currency::TABLE
Rack::Utils::HTTP
STATUS_CODES

In our app, strings in these tables account for a huge portion of long-lived strings in our runtime.
Another example is strings referenced by long-lived rubygem specifications. From a ObjectSpace.dump_all snapshot:

$ grep '"MIT"' heap.json | wc -l
73

With the proposed patch, a user (or ideally library maintainer) can easily de-duplicate strings in known long-lived objects:

Gem::Specification._all.each{ |s| s.license = s.license.dedup if s.license }.size
=> 304

A simple implementation follows.

diff --git a/string.c b/string.c
index f8dd03d..8294c78 100644
--- a/string.c
+++ b/string.c
@@ -145,7 +145,7 @@ fstrupdatecallback(stdatat *key, stdatat *value, stdatat arg, int existi
return ST_STOP;
}

  • if (STRSHAREDP(str)) {
  • if (STRSHAREDP(str) || RBASICCLASS(str) != rbcString) { /* str should not be shared */ str = rbencstrnew(RSTRINGPTR(str), RSTRINGLEN(str), STRENCGET(str)); OBJFREEZE(str); @@ -8278,6 +8278,20 @@ strscrubbang(int argc, VALUE *argv, VALUE str) return str; }

+/*
+ * call-seq:
+ * str.dedup -> str
+ *
+ * Returns a frozen version of this string. If possible, an existing
+ * object with the same value will be returned.
+ /
+
+static VALUE
+strdedup(VALUE self)
+{
+ return rb
fstring(self);
+}
+
/
*********************************************************************
* Document-class: Symbol
*
@@ -8768,6 +8782,7 @@ InitString(void)
rb
definemethod(rbcString, "scrub", strscrub, -1);
rb
definemethod(rbcString, "scrub!", strscrubbang, -1);
rbdefinemethod(rbcString, "freeze", rbobjfreeze, 0);
+ rb
definemethod(rbcString, "dedup", str_dedup, 0);

 rb_define_method(rb_cString, "to_i", rb_str_to_i, -1);
 rb_define_method(rb_cString, "to_f", rb_str_to_f, 0);

diff --git a/test/ruby/teststring.rb b/test/ruby/teststring.rb
index 7ce1c06..d8c414b 100644
--- a/test/ruby/teststring.rb
+++ b/test/ruby/test
string.rb
@@ -600,6 +600,13 @@ class TestString < Test::Unit::TestCase
end
end

  • def test_dedup
  • fstr = "foobar".freeze +
  • assert_same fstr, S("foobar").dedup
  • assert_same fstr, S("foobar").dup.dedup
  • end + def test_each save = $/ $/ = "\n"

Related issues

Duplicates ruby-trunk - Feature #8977: String#frozen that takes advantage of the deduping Assigned 10/02/2013

History

#1 Updated by Shyouhei Urabe 4 months ago

It sounds a bit too rash. For instance other Ruby implementations might not need such thing.

I prefer that kind of optimization to happen automatically when needed, than force programmer to cast a spell.

#2 Updated by Matthew Kerwin 4 months ago

Would this be better as MRI's implementation of String#freeze ?

#3 Updated by Aman Gupta 4 months ago

I prefer that kind of optimization to happen automatically when needed, than force programmer to cast a spell.

I agree, but I'm not sure how MRI can perform this optimization automatically. For instance, in the rubygems case:

Gem::Specification.new do |s|
s.license = "MIT"
end

Above, there is no way for the VM to know that the string "MIT" should be immutable. The VM must expect that the user might try to modify it's value, and so it will always return a duplicated copy of the underlying frozen string.

For this case, the user can write "MIT".freeze. But since this only works on string literals, it cannot be used to perform de-duplication on strings loaded from json/yaml files or other external sources.

Would this be better as MRI's implementation of String#freeze ?

This was discussed in #8992 and I think there was some resistance to #freeze returning a new object.

#4 Updated by Aman Gupta 4 months ago

  • Status changed from Open to Closed

This is a dupe of #8977. The proposal there is to use String#frozen, which I like better as well.

#5 Updated by Matthew Kerwin 4 months ago

tmm1 (Aman Gupta) wrote:

Would this be better as MRI's implementation of String#freeze ?

This was discussed in #8992 and I think there was some resistance to #freeze returning a new object.

We could always introduce #freeze! which is guaranteed to return the same object. Any old code that relies on receiving the same object reference would have to be updated, but I'm okay with that (not that it's up to me.)

Also available in: Atom PDF