Project

General

Profile

Actions

Feature #14478

closed

String #uminus should de-dupe unconditionally

Added by sam.saffron (Sam Saffron) about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:85562]

Description

continuing: https://bugs.ruby-lang.org/issues/14475

Current documentation for String uminus says: "If the string is frozen, then return the string itself."

Trouble is that there is no simple way to de-duplicate unconditionally without inefficiency:

Say x is an arbitrary string (either frozen or unfrozen) :

x = -x # may return a non fstring

x = -+x # will return fstring, but makes an unneeded copy

x = -x.dup # fstring again, uneeded copy 

x = x.frozen? ? -+x : -x # too verbose, uneeded copy

Instead why not change it so - is deduped unconditionally?

I would argue this is worth backporting, cause if we are making fstring optimisations now, we are going to be stuck with legacy inefficient code going forward.

An alternative may be a c-extension gem that adds #fstring to String but that just feel wrong.

I think the documentation should say:

String uminus says: "If the string is de-duplicated, then return the string itself."

Happy to make the change it is quite simple:(FL_TEST(str, RSTRING_FSTR)


Files


Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #14475: String de-duplication is broken in files with frozen_string_literal: trueClosedActions

Updated by phluid61 (Matthew Kerwin) about 6 years ago

Why not make it impossible to create a frozen string that isn't deduped? That way the current -x works fine by default.

(I thought that was already the case..?)

Updated by sam.saffron (Sam Saffron) about 6 years ago

Why not make it impossible to create a frozen string that isn't deduped?

This can not be done without enormous breakage, it would break the semantics of freeze, was already discussed with Matz and decided against.

puts x.object_id
x.freeze
puts x.object_id # must not change 

Updated by phluid61 (Matthew Kerwin) about 6 years ago

sam.saffron (Sam Saffron) wrote:

puts x.object_id
x.freeze
puts x.object_id # must not change 

That makes sense, thanks. It seems there are two parts to the issue:

  1. we should expose deduplication in the language

  2. -@ should be one such exposure

I agree with 1. I also agree with 2, because I can't think of anything legitimate that depends on frozen duplicated string ids not changing.

Actions #4

Updated by duerst (Martin Dürst) about 6 years ago

  • Related to Bug #14475: String de-duplication is broken in files with frozen_string_literal: true added

Updated by sam.saffron (Sam Saffron) about 6 years ago

-@ should be one such exposure

Oh this is already the case, see the source:

/*
 * call-seq:
 *   -str  -> str (frozen)
 *
 * If the string is frozen, then return the string itself.
 *
 * If the string is not frozen, return a frozen, possibly pre-existing
 * copy of it.
 */
static VALUE
str_uminus(VALUE str)
{
    if (OBJ_FROZEN(str)) {
	return str;
    }
    else {
	return rb_fstring(str);
    }
}

I am proposing to change this to:

/*
 * call-seq:
 *   -str  -> str (frozen)
 *
 * If the string is de-duplicated, then return the string itself.
 *
 * If the string is not return a frozen, possibly pre-existing
 * copy of it.
 */
static VALUE
str_uminus(VALUE str)
{
    if (FL_TEST(str, RSTRING_FSTR)) {
	return str;
    }
    else {
	return rb_fstring(str);
    }
}

Updated by phluid61 (Matthew Kerwin) about 6 years ago

sam.saffron (Sam Saffron) wrote:

-@ should be one such exposure

Oh this is already the case, see the source:

That's an implementation detail. -@'s contract is about freezing, and says nothing about deduplication. What you're proposing is to change the contract, and specify the deduplication.

Updated by sam.saffron (Sam Saffron) about 6 years ago

The contract says "possibly pre-existing copy of it." ... where possibly means "if the string is not frozen"

So the contract now is "return a de-duplicated string if the string is not frozen" which is a bit of a crazy contract if you ask me, instead the contract should simply be "return a de-duplicated string always"

What I am asking for here is change str_uminus from:

"sometimes return a de-duplicated string"

To

"always return a de-duplicated string"

I feel the change is not risky at all and will not break anyone in practice, I also think it should be backported to 2.5 cause -+x is an ugly and horrible pattern for us to be promoting.

Also related:

https://bugs.ruby-lang.org/issues/11782

and

https://bugs.ruby-lang.org/issues/13077

Also to quote Matz:

For the time being, let us make -@ to call rb_fstring.
If users want more descriptive name, let's discuss later.
In my opinion, fstring is not acceptable.

So there you have it it seems Matz expected this not to be a sometimes thing.

Updated by ko1 (Koichi Sasada) about 6 years ago

Summary of this issue:

Background

We changed to return dedup-ed string with String-@.

https://bugs.ruby-lang.org/issues/13077#note-7

For the time being, let us make -@ to call rb_fstring.

Issue

However, it returns non-deduped string when self is frozen
because of OBJ_FREEZE(str) check (https://bugs.ruby-lang.org/issues/14478#note-5).

It is inconsistent. We can't use it for dedup functionality.

Proposal

We always need to return dedup-ed strings with String#-@


This proposal makes sense for me, at least I understand "inconsistency".

Updated by normalperson (Eric Wong) about 6 years ago

To reduce compatibility risk, how about we do heap sharing with
fstring for already-frozen strings when calling uminus?

It would deduplicate large strings which require separate
allocations; but won't reduce object count or help with small
strings, though.

Updated by sam.saffron (Sam Saffron) about 6 years ago

how about we do heap sharing

I am very against this, I just don't see the compatibility risk I think myself and maybe only another like 5 people know about the uminus change in 2.5 :) I am willing to bet a very large sum of money nobody is depending on object_id stability when calling uminus on a frozen string. This would be people who

  1. Do not depend on object_id stability when calling uminus on non frozen strings
  2. Somehow for some crazy reason depend on object_id stability when called on frozen strings

I also can not even think of an actual use case where you would depend on this inconsistent behavior

If anything I think we are not going far enough, I think uminus should unconditionally untaint, but that is a topic for another issue. Also I wish we would just remove taint and trust.

Updated by Eregon (Benoit Daloze) about 6 years ago

Since it's var = -str, it's very clear the return value should be used (and deduped) and we can't expect str to changed.
So if the only concern is returning str if str is frozen in -str, I agree that seems a very low compatibility risk.

Updated by shevegen (Robert A. Heiler) about 6 years ago

I don't have any pro or con opinion here but I wanted to comment on
one statement:

maybe only another like 5 people know about the uminus change in 2.5 :)

Actually I noticed it first when nobu was using it, several months ago. :)

Noticed it in some of the ruby changelog and the method was new to me.

I have not used uminus myself, mostly because I do not like the syntax,
even though I understand that it is easier than adding checks for
frozen strings and using .dup (which I actually am doing nowadays a
lot, even though I could use uminus instead, but I like the .frozen?
and .dup check, even though it is more code; I use more frozen strings
these days, mostly because I have noticed a significant speed improvement
in general; and for ruby scripts that do a lot of work, I actually try
to "optimize" these simply because it can make a big difference between
e. g. 30 seconds and 20 seconds execution run, for larger ruby
scripts/ruby projects).

Updated by matz (Yukihiro Matsumoto) about 6 years ago

I agree with de-duping. I also want the reference document to be updated.

Matz.

Updated by sam.saffron (Sam Saffron) about 6 years ago

I just tried the trivial patch of:

diff --git a/string.c b/string.c
index ebf5618..a9b991f 100644
--- a/string.c
+++ b/string.c
@@ -2605,20 +2605,16 @@ str_uplus(VALUE str)
  * call-seq:
  *   -str  -> str (frozen)
  *
- * If the string is frozen, then return the string itself.
+ * Return a frozen, possibly pre-existing
+ * copy of the string.
  *
- * If the string is not frozen, return a frozen, possibly pre-existing
- * copy of it.
+ * String will be deduplicated as long as it is not tainted,
+ * or has any instance vars set on it.
  */
 static VALUE
 str_uminus(VALUE str)
 {
-    if (OBJ_FROZEN(str)) {
-       return str;
-    }
-    else {
-       return rb_fstring(str);
-    }
+    return rb_fstring(str);
 }

But this seems to break the test/ruby/test_string.rb cause there is some edge case we are missing here.

Updated by normalperson (Eric Wong) about 6 years ago

wrote:

But this seems to break the test/ruby/test_string.rb cause
there is some edge case we are missing here.

I didn't see a failure in test_string.rb, but got one rubyspec failure:

1)
String#-@ is an identity function if the string is frozen FAILED
Expected "this string is frozen"
to be identical to "this string is frozen"

$srcdir/ruby/spec/ruby/core/string/uminus_spec.rb:38:in `block (4 levels) in <top (required)>'
$srcdir/ruby/spec/ruby/core/string/uminus_spec.rb:4:in `block in <top (required)>'
$srcdir/ruby/spec/ruby/core/string/uminus_spec.rb:3:in `<top (required)>'

I guess this requires a change to rubyspec:

diff --git a/spec/ruby/core/string/uminus_spec.rb b/spec/ruby/core/string/uminus_spec.rb
index f18e6b1234..9474666e30 100644
--- a/spec/ruby/core/string/uminus_spec.rb
+++ b/spec/ruby/core/string/uminus_spec.rb
@@ -35,11 +35,9 @@
it "is an identity function if the string is frozen" do
dynamic = %w(this string is frozen).join(' ').freeze

-        (-dynamic).should equal(dynamic)
-
dynamic.should_not equal("this string is frozen".freeze)
-        (-dynamic).should_not equal("this string is frozen".freeze)
-        (-dynamic).should_not equal(-"this string is frozen".freeze)
+        (-dynamic).should equal("this string is frozen".freeze)
+        (-dynamic).should equal(-"this string is frozen".freeze)
end
end
end

Disclaimer: I am not knowledgeable in *spec DSL/API so I'm just deleted
characters until the test passed :>

Updated by Eregon (Benoit Daloze) about 6 years ago

The spec reflects the current state but of course it's fine to change.
The change to specs looks good, except the description should be updated:

  • is an identity function if the string is frozen
  • deduplicates frozen strings

(We need to deal with older Ruby versions too in ruby/spec but I can deal with that for this case)

Updated by sam.saffron (Sam Saffron) almost 6 years ago

To me this patch looks good I have tested it previously on discourse, I really like the documentation there I think it is very clear.

Actions #19

Updated by normalperson (Eric Wong) almost 6 years ago

  • Status changed from Open to Closed

Applied in changeset trunk|r63566.


String#uminus dedupes unconditionally

[Feature #14478] [ruby-core:85669]

Thanks-to: Sam Saffron

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0