Feature #15899
openString#before and String#after
Added by kke (Kimmo Lehto) over 5 years ago. Updated about 5 years ago.
Description
There seems to be no methods for getting a substring before or after a marker.
Too often I see and have to resort to variations of:
str[/(.+?);/, 1]
str.split(';').first
substr, _ = str.split(';', 2)
str.sub(/.*;/, '')
str[0...str.index(';')]
These create intermediate objects or/and are ugly.
The String#delete_suffix
and String#delete_prefix
do not accept regexps and thus only can be used if you first figure out the full prefix or suffix.
For this reason, I suggest something like:
> str = 'application/json; charset=utf-8'
> str.before(';')
=> "application/json"
> str.after(';')
=> " charset=utf-8"
What should happen if the marker isn't found? In my opinion, before
should return the full string and after
an empty string.
Files
test.rb (712 Bytes) test.rb | edd314159 (Edd Morgan), 07/09/2019 06:33 PM | ||
test_mem.rb (326 Bytes) test_mem.rb | edd314159 (Edd Morgan), 07/09/2019 06:33 PM | ||
2269.diff (3.77 KB) 2269.diff | edd314159 (Edd Morgan), 07/09/2019 06:33 PM |
Updated by sawa (Tsuyoshi Sawada) over 5 years ago
Since you are mentioning that String#delete_suffix
and String#delete_prefix
do not accept regexps and that is a weak point, you should better use regexps in the examples illustrating your proposal.
Updated by sawa (Tsuyoshi Sawada) over 5 years ago
Using partition
looks reasonable, and it can accept regexes.
str = 'application/json; charset=utf-8'
before, _, after = str.partition(/; /)
before # => "application/json"
after # => "charset=utf-8"
Updated by shevegen (Robert A. Heiler) over 5 years ago
I can see where it may be useful, since it could shorten code like this:
first_part = "hello world!".split(' ').first
To:
first_part = "hello world!.before(' ')
It is not a huge improvement in my opinion, though. (My comment here has
not yet addressed the other part about using regexes - see a bit later for
that.)
I am not a big fan of the names, though. I somehow associate #before and #after
more with time-based operations; and rack/sinatra middleware (route) filters.
I do not have a better or alternative suggestion, although since we already have
delete_prefix, perhaps we could have some methods that return the desired prefix
instead (or suffix).
As for lack of regex support, I think sawa already pointed out that it may be
better to reason for changing delete_prefix and delete_suffix instead. That way
your demonstrated use case could be simplified as well.
Updated by kke (Kimmo Lehto) over 5 years ago
Using partition looks reasonable, and it can accept regexes.
It also has the problem of creating extra objects that you need to discard with _
or assign and just leave unused.
I am not a big fan of the names, though. I somehow associate #before and #after
more with time-based operations; and rack/sinatra middleware (route) filters.
How about str.preceding(';')
and str.following(';')
?
Perhaps str.prior_to(';')
and str.behind(';')
?
Possibility of opposite reading direction can make these problematic.
str.left_from(';')
, str.right_from(';')
? Sounds a bit clunky.
Head and tail could be the unixy choice and more versatile for other use cases.
class String
def head(count = 10, separator = "\n")
...
end
def tail(count = 10, separator = "\n")
...
end
end
For my example use case, it would become:
str = "application/json; charset=utf-8"
mime = str.head(1, ';')
labels = str.tail(1, ';')
And to emulate something like $ curl xttp://x.example.com | head
you would use response.body.head
Updated by kke (Kimmo Lehto) over 5 years ago
How about first
and last
?
'hello world'.first(2)
=> 'he'
'hello world'.last(2)
=> 'ld'
'hello world'.first
=> 'h'
'hello world'.last
=> 'd'
'hello world'.first(1, ' ')
=> 'hello'
'hello world'.last(1, ' ')
=> 'world'
'application/json; charset=utf-8'.first(1, ';')
=> 'application/json'
Updated by marcandre (Marc-Andre Lafortune) over 5 years ago
sawa is right. Just use partition
and rpartition
.
Updated by edd314159 (Edd Morgan) over 5 years ago
- File test_mem.rb test_mem.rb added
- File test.rb test.rb added
- File 2269.diff 2269.diff added
I'd like to add my +1 to this idea. Splitting a string by a substring (and only caring about the first result) is a use case I run into all the time. In fact, the example given by @kke (Kimmo Lehto) of splitting a Content-Type
HTTP header by the semicolon is the one I needed it for most recently.
It's true, partition
and rpartition
can absolutely achieve the same thing. But they have the side effect of returning (and, of course, allocating) extra String objects that are frequently discarded. This not only negatively impacts performance, but results in less readable code: we have to resort to the convention of prefixing the throwaway variable name with an underscore. This underscore is a convention agreed upon, informally, by humans to indicate the irrelevance of the variable, and I'm sure many Ruby programmers are unaware of the convention, or simply forget about it.
I have suggested an implementation in PR #2269 on Github: https://github.com/ruby/ruby/pull/2269
I also attach the following benchmark to show that when these new methods are used for this use case, performance is ~30% improved for splitting by a String (and moreso when splitting by Regex):
eddmorgan@eddbook ~/Projects/rubydev/build → make run
../ruby/revision.h unchanged
./miniruby -I../ruby/lib -I. -I.ext/common ../ruby/test.rb
user system total real
String#before 0.182367 0.000587 0.182954 ( 0.183625)
String#partition 0.303105 0.000877 0.303982 ( 0.304961)
user system total real
String#after 0.199295 0.000672 0.199967 ( 0.200794)
String#partition 0.302300 0.001409 0.303709 ( 0.305278)
Updated by jonathanhefner (Jonathan Hefner) about 5 years ago
I use monkey-patched versions of these in many of my Ruby scripts. They have a few benefits vs. the alternatives:
- vs.
split
+first
/last
- using
split
can cause an unintended result when the delimiter is not present, e.g."abc".split("x", 2).last == "abc"
- using
- vs.
partition
-
before
andafter
can be chained, and can result in fewer object allocations
-
- vs. regex + capture group
-
before
andafter
are easier to read (and write)
-
I've also found before_last
and after_last
helpful for similar reasons.
kke (Kimmo Lehto) wrote:
What should happen if the marker isn't found? In my opinion,
before
should return the full string andafter
an empty string.
Regarding before
, I agree.
Regarding after
, I originally wrote my monkey-patched after
to return an empty string, but eventually changed it to return nil. I was hesitant because a nil result can be an unexpected "gotcha", but an empty string seems wrong because it throws away information. For example, if str.after("x") == ""
, it might be because the delimiter wasn't found, or because the delimiter was at the end of the string. (Compared to str.before("x") == str
, which always means the delimiter wasn't found.)