Feature #5120
closedString#split needs to be logical
Description
Here are examples showing a surprising and inconsistent behavior of String#split
method:
"aa".split('a') # => []
"aab".split('a') # => ["", "", "b"]
"aaa".split('aa') # => ["", "a"]
"aaaa".split('aa') # => []
"aaaaa".split('aa') # => ["", "", "a"]
"".split('') # => []
"a".split('') # => ["a"]
What is the definition of split
? I would suggest something like this:
str1.split(str2)
returns the maximal array of non-empty substrings of str1
that can be concatenated with copies of str2
to form str1
.
Addition can be made to this definition to make clear what to expect as the result of "baaab".split("aa")
.
Updated by mame (Yusuke Endoh) over 13 years ago
Hello,
2011/7/31 Alexey Muranov muranov@math.univ-toulouse.fr:
 str1.split(str2) returns a maximal array of non-empty substrings of str1 which can be concatenated with copies of str2 to form str1.
So, what does "aab".split('a') return?
I think that only ["aab"] meets the condition.
But it is also surprising to me.
--
Yusuke Endoh mame@tsg.ne.jp
Updated by mame (Yusuke Endoh) over 13 years ago
2011/7/31 Yusuke ENDOH mame@tsg.ne.jp:
Hello,
2011/7/31 Alexey Muranov muranov@math.univ-toulouse.fr:
 str1.split(str2) returns a maximal array of non-empty substrings of str1 which can be concatenated with copies of str2 to form str1.
So, what does "aab".split('a') return?
I think that only ["aab"] meets the condition.
But it is also surprising to me.
Oops. It should be ["a", "b"]. But it is very difficult (to me) behavior.
--
Yusuke Endoh mame@tsg.ne.jp
Updated by Anonymous over 13 years ago
On Jul 30, 2011, at 12:12 PM, Alexey Muranov wrote:
Here are examples showing a surprising and inconsistent behavior of String#split method:
With only one argument, split discards trailing empty fields.
"aa".split('a') # => []
<empty><a><empty><a><empty> => three empty fields all trailing and discarded
"aab".split('a') # => ["", "", "b"]
<empty><a><empty><a><b> => three fields with no empty trailing fields
"aaa".split('aa') # => ["", "a"]
<empty><separator><a> => two fields with no empty trailing field
"aaaa".split('aa') # => []
<empty><aa><empty><aa><empty> => all empty fields, all discarded
"aaaaa".split('aa') # => ["", "", "a"]
<empty><aa><empty><aa><a> => three fields
"".split('') # => []
"a".split('') # => ["a"]
zero-legth match results in all characters being returned in array:
"ab".split('') #=> ['a', 'b']
What is the definition of split?
The String#split documentation clearly states that empty trailing fields are discarded and the special case of a zero-length match.
Gary Wright
Updated by alexeymuranov (Alexey Muranov) over 13 years ago
Thank you for the explanation.
Yes, Yusuke, my definition was wrong, i thought about "a maximal array of non-empty strings whose members don't contain any matching substrings".
Gary, I didn't read the documentation carefully, so i didn't know about the discarding of the trailing empty fields, this is why the result looked illogical to me (why not the beginning empty fields?)
I think str1.split(str2, -1)
produces what i would have expected.
I will think more about it, thanks.
Alexey.
Updated by alexeymuranov (Alexey Muranov) over 13 years ago
It is still not very clear why
"a".split('',-1) # => ["a", ""]
and not ["", "a", ""]
or ["a"]
,
(and why "".split('',-1) # => []
)
Alexey.
Updated by duerst (Martin Dürst) over 13 years ago
On 2011/07/31 1:30, Yusuke ENDOH wrote:
So, what does "aab".split('a') return?
I think that only ["aab"] meets the condition.
But it is also surprising to me.
It's much easier to think about if you replace 'a' with ',' (and if
necessary, think about a format such as CSV).
So what does ",,b".split(',') return?
=> ['', '', 'b']
Regards, Martin.
Updated by alexeymuranov (Alexey Muranov) over 13 years ago
I understand why
",,1".split(',') # => ["","","1"]
and why
".5".split('.') # => ["","5"]
But then ",1,".split(',')
should return ["", "1", ""]
.
It is not clear why one needs to do it like that:
",1,".split(',',-1) # => ["", "1", ""]
The decision to discard trailing empty elements seems random (maybe targeted at processing particular programming languages or application input where optional parameters are placed at the end?).
However, splitting on the empty string does not make sense (cannot be made consistent with these examples).
In my opinion, splitting on the empty string should be forbidden.
To obtain the array of letters (in the given encoding) it would be more logical to introduce a #letters method or use #split without parameters.
Current implementation of split('')
seems inconsistent with the rest: why
"ab".split('') # => ["a", "b"] and not ["", "a", "b"] or ["", "a", "", "b"] ?
"ab".split('',-1) # => ["a", "b", ""] and not ["", "a", "b", ""] ?
Does splitting on the empty string work this way because this is how the general implementation works if fed with the empty string, or is it implemented as a separate case?
In the last case it is not a good solution.
Alexey.
Updated by kirillrdy (Kirill Radzikhovskyy) over 13 years ago
Hi,
I also find this behavior confusing
mainly because:
ruby-1.9.2-p290 :001 > 'a,b,,'.split ','
=> ["a", "b"]
ruby-1.9.2-p290 :002 > ',,a,b'.split ','
=> ["", "", "a", "b"]
Updated by Anonymous over 13 years ago
On Aug 1, 2011, at 12:37 AM, Kirill Radzikhovskyy wrote:
I also find this behavior confusing
mainly because:ruby-1.9.2-p290 :001 > 'a,b,,'.split ','
Updated by alexeymuranov (Alexey Muranov) about 13 years ago
I would like to summarize my feature request:
-
trailing empty fields should not be discarded
(it would make sense, however, to have a similar method which splits and discards initial and trailing empty fields, and returns as the first element of the array the number of initial empty fields discarded, and possibly as the last element of the array the number of trailing empty fields discarded) -
a separate method for getting the array of letters (
#letters
?) should be implemented, split on the empty string should raise an error
(or otherwise it should always return the empty string as the first and the last elements:"a".split("") # => ["", "a", ""]
by analogy with"a".split("a") # => ["", ""]
in the proposed implementation, but this is not very practical).
Update: how about String#to_a?
This is my opinion, please comment.
Alexey.
Updated by aprescott (Adam Prescott) about 13 years ago
On Sun, Sep 11, 2011 at 1:49 PM, Alexey Muranov
muranov@math.univ-toulouse.fr wrote:
- a separate method for getting the array of letters (#letters?) should be implemented, split on the empty string should raise an error
(or otherwise it should always return the empty string as the first and the last elements: "a".split("") # => ["", "a", ""] by analogy with "a".split("a") # => ["", ""] in the proposed implementation, but this is not very practical).
str.split("") already gets you the array of "letters" (as does
str.chars.to_a), but since you feel that str.split("") should raise an
error or have another return value, do you think str.split("") should
break existing code which uses split("") to get characters?
What's the reasoning behind str.split("") raising an error? I can't
see a good reason for it. Equally, I can see no good reason for
treating "a".split("") the same in return value as "a".split("a"). In
the latter, there is more to be considered because the receiver itself
contains "a". Why should "a".split("") return ["", "a", ""]?
Updated by alexeymuranov (Alexey Muranov) about 13 years ago
Adam Prescott wrote:
str.split("") already gets you the array of "letters" (as does
str.chars.to_a), but since you feel that str.split("") should raise an
error or have another return value, do you think str.split("") should
break existing code which uses split("") to get characters?
Thanks for pointing out str.chars.to_a, but i think that it would be more convenient to have a single method that would do this. I understand that this would break existing code, i was discussing the issue from the point of view of "improving" the language, according to what would look like an improvement to me. As a person new to Ruby, i expressed my "astonishment" at the current behavior of #split, and tried to contribute to POLA.
What's the reasoning behind str.split("") raising an error? I can't
see a good reason for it. Equally, I can see no good reason for
treating "a".split("") the same in return value as "a".split("a"). In
the latter, there is more to be considered because the receiver itself
contains "a". Why should "a".split("") return ["", "a", ""]?
I think that #split should treat all strings equally, whether empty or not. Maybe i've missed something (then please point me to the explanation), but i do not see how the treatment of empty and non-empty strings can be particular cases of a general rule. What is the general rule, which gives such different results for empty and non-empty strings?
I think that "a".split("")
should return ["", "a", ""]
, because this would be more logical, than when "a".split("",-1)
returns ["a", ""]
, as it does now.
I think that in most other cases #split(str) should behave as #split(str,-1)
behaves now, because the decision to discard trailing empty elements seems arbitrary.
By analogy with ",".split(",",-1)
currently returning ["",""]
, i think that:
",".split(",")
should return ["",""]
,
"".split("")
should return ["",""]
(if not forbidden altogether),
",".split("")
should return ["", ",", ""]
(if not forbidden).
But, as i said, this is only a suggestion to preserve consistency: use the same rule and same code to split on empty and non-empty strings. (What is the rule now?) I think that if #split
treats the empty string ""
separately from all other string arguments, then #split("")
should be made into a separate method, or be made clearly distinguished by the number or type of arguments, for example: #split()
or #split(:by_chars)
, or #split(:how => :by_chars)
.
The easiest way to be consistent, in my opinion, is to forbid splitting on the empty string and to use a different method to get the array of letters. How about String#to_a
to return the array of letters (in addition to .chars.to_a
)? This would be consistent with the String#[]
method in Ruby 1.9.
Last edited 2011-11-21
Updated by alexeymuranov (Alexey Muranov) over 12 years ago
I would like add a use case which may be not very useful, but in my opinion illustrates the issue well.
I wanted to normalize some mistyped email addresses in a database and i did like this (because i forgot about this issue):
email.split('@').map(&:strip).join('@')
This works for complete email address, but when i ran into
' sam @ '.split('@').map(&:strip).join('@') # => 'sam@'
but
'jim@'.split('@').map(&:strip).join('@') # => 'jim'
i decided to change to
email.split('@', -1).map(&:strip).join('@')
which does not make it more readable (sarcasm :-)).
Updated by drbrain (Eric Hodel) over 12 years ago
=begin
Tell split you want to keep the @ if you want to keep the @:
[' sam @ ', 'jim@'].map { |e| e.split(/(@)/).map(&:strip).join }" # => ["sam@", "jim@"]
=end
Updated by alexeymuranov (Alexey Muranov) over 12 years ago
Well, i didn't really want to keep '@', splitting on it and then joining with it would be fine :).
Thanks anyway.
Updated by mame (Yusuke Endoh) over 12 years ago
- Status changed from Open to Assigned
- Assignee set to matz (Yukihiro Matsumoto)
- Target version set to 3.0
Updated by matz (Yukihiro Matsumoto) over 6 years ago
- Status changed from Assigned to Rejected
At least some of your points are rational. Those behaviors are inherited from Perl.
I don't think we can change the behavior. We are not going to break existing code for the sake of consistency.
Matz.