Feature #13780
closedString#each_grapheme
Description
Ruby's regex engine has support for graphemes via \X
:
https://github.com/k-takata/Onigmo/blob/791140951eefcf17db4e762e789eb046ea8a114c/doc/RE#L117-L124
This is really useful when working with Unicode strings. However, code like string.scan(/\X/)
is not so readable enough, which might lead people to use String#each_char, when they really should split by graphemes.
What I propose is two new methods:
- String#each_grapheme which returns an Enumerator of graphemes (in the same way like
\X
)
and
- String#graphemes which returns an Array of graphemes (in the same way like
\X
)
What do you think?
Resources
- Unicode® Standard Annex #29: Unicode Text Segmentation: http://unicode.org/reports/tr29/
- Related issue: https://bugs.ruby-lang.org/issues/12831
Updated by shevegen (Robert A. Heiler) over 7 years ago
My only concern is about the name "grapheme".
I don't know how it is for others but ... this is the first time that I even heard the
term.
Updated by shan (Shannon Skipper) over 7 years ago
shevegen (Robert A. Heiler) wrote:
My only concern is about the name "grapheme".
I don't know how it is for others but ... this is the first time that I even heard the
term.
I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:
"🇺🇸🇦🇫" |> String.codepoints #=> ["🇺", "🇸", "🇦", "🇫"]
"🇺🇸🇦🇫" |> String.graphemes #=> ["🇺🇸", "🇦🇫"]
Updated by naruse (Yui NARUSE) over 7 years ago
- Status changed from Open to Assigned
- Assignee set to naruse (Yui NARUSE)
- Target version set to 2.5
Accepted.
I'll introduce this in Ruby 2.5.
Updated by naruse (Yui NARUSE) over 7 years ago
shan (Shannon Skipper) wrote:
shevegen (Robert A. Heiler) wrote:
My only concern is about the name "grapheme".
I don't know how it is for others but ... this is the first time that I even heard the
term.I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:
Elixir's grapheme
and Swift's Character
refer Unicode® Standard Annex #29's "Grapheme Cluster".
http://unicode.org/reports/tr29/
The document says grapheme clusters are “user-perceived characters”.
Updated by naruse (Yui NARUSE) over 7 years ago
diff --git a/NEWS b/NEWS
index 4bfca9240c..1e66e94879 100644
--- a/NEWS
+++ b/NEWS
@@ -94,6 +94,7 @@ with all sufficient information, see the ChangeLog file or Redmine
* String#delete_prefix! is added to remove prefix destructively [Feature #12694]
* String#delete_suffix is added to remove suffix [Feature #13665]
* String#delete_suffix! is added to remove suffix destructively [Feature #13665]
+ * String#graphemes is added to enumerate grapheme clusters [Feature #13780]
* Thread
diff --git a/string.c b/string.c
index daef497b3d..dd0daa27e9 100644
--- a/string.c
+++ b/string.c
@@ -8066,6 +8066,117 @@ rb_str_codepoints(VALUE str)
return rb_str_enumerate_codepoints(str, 1);
}
+static VALUE
+rb_str_enumerate_graphemes(VALUE str, int wantarray)
+{
+ regex_t *reg_grapheme = NULL;
+ static regex_t *reg_grapheme_utf8 = NULL;
+ int encidx = ENCODING_GET(str);
+ rb_encoding *enc = rb_enc_from_index(encidx);
+ int unicode_p = rb_enc_unicode_p(enc);
+ const char *ptr, *end;
+ VALUE ary;
+
+ if (!unicode_p) {
+ return rb_str_enumerate_codepoints(str, wantarray);
+ }
+
+ /* synchronize */
+ if (encidx == rb_utf8_encindex() && reg_grapheme_utf8) {
+ reg_grapheme = reg_grapheme_utf8;
+ }
+ if (!reg_grapheme) {
+ const OnigUChar source[] = "\\X";
+ int r = onig_new(®_grapheme, source, source + sizeof(source) - 1,
+ ONIG_OPTION_DEFAULT, enc, OnigDefaultSyntax, NULL);
+ if (r) {
+ rb_bug("cannot compile grapheme cluster regexp");
+ }
+ if (encidx == rb_utf8_encindex()) {
+ reg_grapheme_utf8 = reg_grapheme;
+ }
+ }
+
+ ptr = RSTRING_PTR(str);
+ end = RSTRING_END(str);
+
+ if (rb_block_given_p()) {
+ if (wantarray) {
+#if STRING_ENUMERATORS_WANTARRAY
+ rb_warn("given block not used");
+ ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/
+#else
+ rb_warning("passing a block to String#codepoints is deprecated");
+ wantarray = 0;
+#endif
+ }
+ }
+ else {
+ if (wantarray)
+ ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/
+ else
+ return SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size);
+ }
+
+ while (ptr < end) {
+ VALUE grapheme;
+ OnigPosition len = onig_match(reg_grapheme,
+ (const OnigUChar *)ptr, (const OnigUChar *)end,
+ (const OnigUChar *)ptr, NULL, 0);
+ if (len == 0) break;
+ if (len < 0) {
+ break;
+ }
+ grapheme = rb_enc_str_new(ptr, len, enc);
+ if (wantarray)
+ rb_ary_push(ary, grapheme);
+ else
+ rb_yield(grapheme);
+ ptr += len;
+ }
+ if (wantarray)
+ return ary;
+ else
+ return str;
+}
+
+/*
+ * call-seq:
+ * str.each_grapheme {|cstr| block } -> str
+ * str.each_grapheme -> an_enumerator
+ *
+ * Passes each grapheme cluster in <i>str</i> to the given block, or returns
+ * an enumerator if no block is given.
+ * Unlike String#each_char, this enumerates by grapheme clusters defined by
+ * Unicode Standard Annex #29 http://unicode.org/reports/tr29/
+ *
+ * "a\u0300".each_chars.to_a.size #=> 2
+ * "a\u0300".each_grapheme.to_a.size #=> 1
+ *
+ */
+
+static VALUE
+rb_str_each_grapheme(VALUE str)
+{
+ return rb_str_enumerate_graphemes(str, 0);
+}
+
+/*
+ * call-seq:
+ * str.graphemes -> an_array
+ *
+ * Returns an array of grapheme clusters in <i>str</i>. This is a shorthand
+ * for <code>str.each_grapheme.to_a</code>.
+ *
+ * If a block is given, which is a deprecated form, works the same as
+ * <code>each_grapheme</code>.
+ */
+
+static VALUE
+rb_str_graphemes(VALUE str)
+{
+ return rb_str_enumerate_graphemes(str, 1);
+}
static long
chopped_length(VALUE str)
@@ -10477,6 +10588,7 @@ Init_String(void)
rb_define_method(rb_cString, "bytes", rb_str_bytes, 0);
rb_define_method(rb_cString, "chars", rb_str_chars, 0);
rb_define_method(rb_cString, "codepoints", rb_str_codepoints, 0);
+ rb_define_method(rb_cString, "graphemes", rb_str_graphemes, 0);
rb_define_method(rb_cString, "reverse", rb_str_reverse, 0);
rb_define_method(rb_cString, "reverse!", rb_str_reverse_bang, 0);
rb_define_method(rb_cString, "concat", rb_str_concat_multi, -1);
@@ -10532,6 +10644,7 @@ Init_String(void)
rb_define_method(rb_cString, "each_byte", rb_str_each_byte, 0);
rb_define_method(rb_cString, "each_char", rb_str_each_char, 0);
rb_define_method(rb_cString, "each_codepoint", rb_str_each_codepoint, 0);
+ rb_define_method(rb_cString, "each_grapheme", rb_str_each_grapheme, 0);
rb_define_method(rb_cString, "sum", rb_str_sum, -1);
diff --git a/test/ruby/test_string.rb b/test/ruby/test_string.rb
index e88d749123..e3b44725df 100644
--- a/test/ruby/test_string.rb
+++ b/test/ruby/test_string.rb
@@ -885,6 +885,46 @@ def test_chars
end
end
+ def test_each_grapheme
+ [
+ "\u{20 200d}",
+ "\u{600 600}",
+ "\u{600 20}",
+ "\u{261d 1F3FB}",
+ "\u{1f600}",
+ "\u{20 308}",
+ "\u{1F477 1F3FF 200D 2640 FE0F}",
+ "\u{1F468 200D 1F393}",
+ "\u{1F46F 200D 2642 FE0F}",
+ "\u{1f469 200d 2764 fe0f 200d 1f469}",
+ ].each do |g|
+ assert_equal [g], g.each_grapheme.to_a
+ end
+
+ assert_equal ["\u000A", "\u0308"], "\u{a 308}".each_grapheme.to_a
+ assert_equal ["\u000D", "\u0308"], "\u{d 308}".each_grapheme.to_a
+ end
+
+ def test_graphemes
+ [
+ "\u{20 200d}",
+ "\u{600 600}",
+ "\u{600 20}",
+ "\u{261d 1F3FB}",
+ "\u{1f600}",
+ "\u{20 308}",
+ "\u{1F477 1F3FF 200D 2640 FE0F}",
+ "\u{1F468 200D 1F393}",
+ "\u{1F46F 200D 2642 FE0F}",
+ "\u{1f469 200d 2764 fe0f 200d 1f469}",
+ ].each do |g|
+ assert_equal [g], g.graphemes
+ end
+
+ assert_equal ["\u000A", "\u0308"], "\u{a 308}".graphemes
+ assert_equal ["\u000D", "\u0308"], "\u{d 308}".graphemes
+ end
+
def test_each_line
save = $/
$/ = "\n"
Updated by nobu (Nobuyoshi Nakada) over 7 years ago
naruse (Yui NARUSE) wrote:
+ if (!unicode_p) { + return rb_str_enumerate_codepoints(str, wantarray); + }
Why codepoints?
Updated by naruse (Yui NARUSE) over 7 years ago
nobu (Nobuyoshi Nakada) wrote:
naruse (Yui NARUSE) wrote:
+ if (!unicode_p) { + return rb_str_enumerate_codepoints(str, wantarray); + }
Why codepoints?
Ah, it should be chars; thanks!
Updated by rbjl (Jan Lelis) over 7 years ago
Great to see this implemented!
One tiny thing I've noticed:
- For non-Unicode strings,
\X
will still match "\r\n" as a single grapheme. This should probably also be the case withString#each_grapheme
- or the difference should be clearly documented
Updated by rbjl (Jan Lelis) over 7 years ago
And a typo in "a\u0300".each_chars.to_a.size #=> 2
,
should be "a\u0300".each_char.to_a.size #=> 2
Updated by matz (Yukihiro Matsumoto) about 7 years ago
grapheme
sounds like an element in the grapheme cluster. How about each_grapheme_cluster
?
If everyone gets used to the grapheme
as an alias of grapheme cluster
, we'd love to add an alias each_grapheme
.
Matz.
Updated by naruse (Yui NARUSE) about 7 years ago
- Status changed from Assigned to Closed
Applied in changeset trunk|r59698.
String#each_grapheme_cluster and String#grapheme_clusters
added to enumerate grapheme clusters [Feature #13780]
Updated by mame (Yusuke Endoh) almost 3 years ago
- Related to Feature #18563: Add "graphemes" and "each_grapheme" aliases added