Feature #13077
closed[PATCH] introduce String#fstring method
Description
introduce String#fstring method
This exposes the rb_fstring internal function to return a
deduped and frozen string. This is useful for writing all sorts
of record processing key values maybe stored, but certain keys
and values are often duplicated at a high frequency,
so memory savings can noticeable.
Use cases are many:
-
email/NNTP header processing
There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA. -
package management systems -
things like RubyGems stores identical strings for licenses,
dependency names, author names/emails, etc -
HTTP headers/trailers -
standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
are common, but there are also uncommon ones.
Values may be deduped, as well, as it is likely a user
agent will make multiple/parallel requests to the same
server. -
version control systems -
this can be useful for deduplicating names of frequent
committers (like "nobu" :)In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones. -
audio metadata -
There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too. -
JSON, YAML, XML, HTML processing
certain fields, tags and attributes are commonly used
across the same and multiple documents
Files
Updated by shevegen (Robert A. Heiler) about 8 years ago
I have no particular pro or con opinion on the proposal in itself so feel free to ignore this.
The only comment I have is that the name .fstring() is a bit strange. On first read, I
assumed that it was short for "format_string" like % on class String or sprintf.
In the proposal I read that it is for frozen_string e. g. rb_fstring. While I don't have
anything against the functionality, and I also don't fully mind a method called
fstring(), I think that at the least a longer alias name to it would be nice to have too
such as frozen_string or something that is more readable on a first look. (I can't comment
on whether the functionality in itself is useful or not but I assume that Eric has had
a good reason which he described too, so I have no qualms at all with the functionality
in itself, only the method-name part.)
Updated by normalperson (Eric Wong) about 8 years ago
shevegen@gmail.com wrote:
The only comment I have is that the name .fstring() is a bit strange. On first read, I
assumed that it was short for "format_string" like % on class String or sprintf.
Yeah, the name isn't final, of course; naming is the hardest
problem in computer science :<
Maybe "dedup" is better and still short (I would expect the user
to know deduplication implicitly requires a frozen string)
Updated by Eregon (Benoit Daloze) about 8 years ago
So this is essentially like Java's String.intern()?
There is already String#intern in Ruby but it returns a Symbol.
Depending on the use-case, I guess this might be less convenient than getting a de-duplicated String.
String#dedup or sounds better than #fstring.
Updated by normalperson (Eric Wong) about 8 years ago
eregontp@gmail.com wrote:
So this is essentially like Java's String.intern()?
There is already String#intern in Ruby but it returns a Symbol.
Depending on the use-case, I guess this might be less convenient than getting a de-duplicated String.
Yeah, I considered using intern/to_sym for my use case;
but the problem is it that still creates a new string object
whenever it needs to be written/printed/concatenated.
And I also feel using symbol like this is ugly (just a gut
feeling), despite having GC-able symbols since 2.2.
String#dedup or sounds better than #fstring.
Yes. Lets wait for Matz to comment.
Updated by nobu (Nobuyoshi Nakada) about 8 years ago
Why not String#-@
?
Updated by normalperson (Eric Wong) about 8 years ago
nobu@ruby-lang.org wrote:
Why not
String#-@
?
As in the following? (short patch, full below)
--- a/string.c
+++ b/string.c
@@ -10002,7 +9989,7 @@ Init_String(void)
rb_define_method(rb_cString, "scrub!", str_scrub_bang, -1);
rb_define_method(rb_cString, "freeze", rb_str_freeze, 0);
rb_define_method(rb_cString, "+@", str_uplus, 0);
- rb_define_method(rb_cString, "-@", str_uminus, 0);
+ rb_define_method(rb_cString, "-@", rb_fstring, 0);
rb_define_method(rb_cString, "to_i", rb_str_to_i, -1);
rb_define_method(rb_cString, "to_f", rb_str_to_f, 0);
Changing existing behavior method might break compatibility;
but test-all and test-rubyspec seems to pass...
full: https://80x24.org/spew/20161228024937.9345-1-e@80x24.org/raw
Updated by matz (Yukihiro Matsumoto) almost 8 years ago
For the time being, let us make -@
to call rb_fstring
.
If users want more descriptive name, let's discuss later.
In my opinion, fstring
is not acceptable.
Matz.
Updated by normalperson (Eric Wong) almost 8 years ago
matz@ruby-lang.org wrote:
For the time being, let us make
-@
to callrb_fstring
.
If users want more descriptive name, let's discuss later.
In my opinion,fstring
is not acceptable.
OK, I think the following is always backwards compatible,
unlike my previous [ruby-core:78884]:
--- a/string.c
+++ b/string.c
@@ -2530,7 +2530,7 @@ str_uminus(VALUE str)
return str;
}
else {
- return rb_str_freeze(rb_str_dup(str));
+ return rb_fstring(str);
}
}
Will commit in a day or two.
Updated by shyouhei (Shyouhei Urabe) almost 8 years ago
A bit of security consideration:
Am I correct that rb_vm_fstring_table() is never GCed? If so feeding user-generated strings to that table needs extra care. Malicious user input might let memory exhausted.
Updated by nobu (Nobuyoshi Nakada) almost 8 years ago
Shyouhei Urabe wrote:
Am I correct that rb_vm_fstring_table() is never GCed?
That table is not a GC-root, and registered strings get GCed as usual.
Updated by shyouhei (Shyouhei Urabe) almost 8 years ago
Nobuyoshi Nakada wrote:
Shyouhei Urabe wrote:
Am I correct that rb_vm_fstring_table() is never GCed?
That table is not a GC-root, and registered strings get GCed as usual.
So this is a kind of weak reference? No security concern then.
Updated by Anonymous almost 8 years ago
- Status changed from Open to Closed
Applied in changeset r57698.
string.c (str_uminus): deduplicate strings
This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.
Use cases are many:
-
email/NNTP header processing
There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA. -
package management systems -
things like RubyGems stores identical strings for licenses,
dependency names, author names/emails, etc -
HTTP headers/trailers -
standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
are common, but there are also uncommon ones.
Values may be deduped, as well, as it is likely a user
agent will make multiple/parallel requests to the same
server. -
version control systems -
this can be useful for deduplicating names of frequent
committers (like "nobu" :)In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones. -
audio metadata -
There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too. -
JSON, YAML, XML, HTML processing
Certain fields, tags and attributes are commonly used
across the same and multiple documents
There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.
[Feature #13077] [ruby-core:79663]
Updated by normalperson (Eric Wong) almost 8 years ago
shyouhei@ruby-lang.org wrote:
Nobuyoshi Nakada wrote:
Shyouhei Urabe wrote:
Am I correct that rb_vm_fstring_table() is never GCed?
That table is not a GC-root, and registered strings get GCed as usual.
So this is a kind of weak reference? No security concern then.
Right. Also, keep in mind that dynamic GC-able symbols from
2.2+ also stores symbol names as fstrings. Thus GC-able symbols
would not work if fstrings could not be GC-ed.
Anyways, committed as r57698
Updated by Eregon (Benoit Daloze) almost 8 years ago
Eric Wong wrote:
Anyways, committed as r57698
This should have a NEWS entry and tests since it changes the semantics.
BTW, should my_string.freeze behave similarly to String#@-?
Otherwise String#freeze only dedup if the String is a literal.
Always deduping for String#freeze would make the semantics more consistent.
Updated by normalperson (Eric Wong) almost 8 years ago
eregontp@gmail.com wrote:
Eric Wong wrote:
Anyways, committed as r57698
This should have a NEWS entry and tests since it changes the semantics.
Sorry, I forgot; will do. Thanks for the reminder.
BTW, should my_string.freeze behave similarly to String#@-?
Otherwise String#freeze only dedup if the String is a literal.
Always deduping for String#freeze would make the semantics more consistent.
No. There is existing code which assumes #freeze always returns
the same object as its caller. Changing #freeze will break
existing code.
We can only cheat with String literals (opt_str_freeze) because
literals are not assigned to user-visible variables, yet.
Updated by Eregon (Benoit Daloze) almost 8 years ago
Eric Wong wrote:
No. There is existing code which assumes #freeze always returns
the same object as its caller. Changing #freeze will break
existing code.We can only cheat with String literals (opt_str_freeze) because
literals are not assigned to user-visible variables, yet.
Oh indeed, that slipped my mind, thanks for the explanation.
Updated by naruse (Yui NARUSE) over 4 years ago
- Has duplicate Feature #17147: New method to get frozen strings from String objects added