Project

General

Profile

Feature #15771

Add `String#split` option to set split_type string when a single space separator

Added by 284km (kazuma furuhashi) about 1 month ago. Updated about 1 month ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:92301]

Description

In String#split, when separator is a single space character, it execute as split_type: awk.

For example, CSV library handles it like this.
https://github.com/ruby/csv/blob/7ff57a50e81c368029fa9b664700bec4a456b81b/lib/csv/parser.rb#L508-L512

if @column_separator == " ".encode(@encoding)
  @split_column_separator = Regexp.new(@escaped_column_separator)
else
  @split_column_separator = @column_separator
end

Unfortunately, in this case regexp is slower than string. For example,
the following result is about 9 times slower.
https://github.com/284km/benchmarks_no_yatu#stringsplitstring-or-regexp

$ be benchmark-driver string_split_string-regexp.yml --rbenv '2.6.2'
Comparison:
              string:   3161117.6 i/s
              regexp:    344448.0 i/s - 9.18x  slower

So I want to add the :literal option to run as split_type: string.

Implementation

This change will result in the following:

" a  b   c    ".split(" ")
=> ["a", "b", "c"]
" a  b   c    ".split(" ", -1)
=> ["a", "b", "c", ""]
" a  b   c    ".split(" ", literal: true)
=> ["", "a", "", "b", "", "", "c"]
" a  b   c    ".split(" ", -1, literal: true)
=> ["", "a", "", "b", "", "", "c", "", "", "", ""]

Also available in: Atom PDF