Ruby master - Feature #13780: String#each_grapheme</h1> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-03T19:09:01Z</p> <ul></ul><p>My only concern is about the name "grapheme".</p> <p>I don't know how it is for others but ... this is the first time that I even heard the<br> term.</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-03T21:11:31Z</p> <ul></ul><p>shevegen (Robert A. Heiler) wrote:</p> <blockquote> <p>My only concern is about the name "grapheme".</p> <p>I don't know how it is for others but ... this is the first time that I even heard the<br> term.</p> </blockquote> <p>I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:</p> <pre><code>"🇺🇸🇦🇫" |> String.codepoints #=> ["🇺", "🇸", "🇦", "🇫"] "🇺🇸🇦🇫" |> String.graphemes #=> ["🇺🇸", "🇦🇫"] </code></pre> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-04T13:57:10Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li><li><strong>Assignee</strong> set to <i>naruse (Yui NARUSE)</i></li><li><strong>Target version</strong> set to <i>2.5</i></li></ul><p>Accepted.<br> I'll introduce this in Ruby 2.5.</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-05T10:50:16Z</p> <ul></ul><p>shan (Shannon Skipper) wrote:</p> <blockquote> <p>shevegen (Robert A. Heiler) wrote:</p> <blockquote> <p>My only concern is about the name "grapheme".</p> <p>I don't know how it is for others but ... this is the first time that I even heard the<br> term.</p> </blockquote> <p>I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:</p> </blockquote> <p>Elixir's <code>grapheme</code> and Swift's <code>Character</code> refer Unicode® Standard Annex #29's "Grapheme Cluster".<br> <a href="http://unicode.org/reports/tr29/" class="external">http://unicode.org/reports/tr29/</a><br> The document says grapheme clusters are “user-perceived characters”.</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-12T17:40:11Z</p> <ul></ul><pre><code class="diff syntaxhl" data-language="diff"><span class="gh">diff --git a/NEWS b/NEWS index 4bfca9240c..1e66e94879 100644 </span><span class="gd">--- a/NEWS </span><span class="gi">+++ b/NEWS </span><span class="p">@@ -94,6 +94,7 @@</span> with all sufficient information, see the ChangeLog file or Redmine * String#delete_prefix! is added to remove prefix destructively [Feature #12694] * String#delete_suffix is added to remove suffix [Feature #13665] * String#delete_suffix! is added to remove suffix destructively [Feature #13665] <span class="gi">+ * String#graphemes is added to enumerate grapheme clusters [Feature #13780] </span> * Thread <span class="gh">diff --git a/string.c b/string.c index daef497b3d..dd0daa27e9 100644 </span><span class="gd">--- a/string.c </span><span class="gi">+++ b/string.c </span><span class="p">@@ -8066,6 +8066,117 @@</span> rb_str_codepoints(VALUE str) return rb_str_enumerate_codepoints(str, 1); } <span class="gi">+static VALUE +rb_str_enumerate_graphemes(VALUE str, int wantarray) +{ + regex_t *reg_grapheme = NULL; + static regex_t *reg_grapheme_utf8 = NULL; + int encidx = ENCODING_GET(str); + rb_encoding *enc = rb_enc_from_index(encidx); + int unicode_p = rb_enc_unicode_p(enc); + const char *ptr, *end; + VALUE ary; + + if (!unicode_p) { + return rb_str_enumerate_codepoints(str, wantarray); + } + + /* synchronize */ + if (encidx == rb_utf8_encindex() && reg_grapheme_utf8) { + reg_grapheme = reg_grapheme_utf8; + } + if (!reg_grapheme) { + const OnigUChar source[] = "\\X"; + int r = onig_new(&reg_grapheme, source, source + sizeof(source) - 1, + ONIG_OPTION_DEFAULT, enc, OnigDefaultSyntax, NULL); + if (r) { + rb_bug("cannot compile grapheme cluster regexp"); + } + if (encidx == rb_utf8_encindex()) { + reg_grapheme_utf8 = reg_grapheme; + } + } + + ptr = RSTRING_PTR(str); + end = RSTRING_END(str); + + if (rb_block_given_p()) { + if (wantarray) { +#if STRING_ENUMERATORS_WANTARRAY + rb_warn("given block not used"); + ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/ +#else + rb_warning("passing a block to String#codepoints is deprecated"); + wantarray = 0; +#endif + } + } + else { + if (wantarray) + ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/ + else + return SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size); + } + + while (ptr < end) { + VALUE grapheme; + OnigPosition len = onig_match(reg_grapheme, + (const OnigUChar *)ptr, (const OnigUChar *)end, + (const OnigUChar *)ptr, NULL, 0); + if (len == 0) break; + if (len < 0) { + break; + } + grapheme = rb_enc_str_new(ptr, len, enc); + if (wantarray) + rb_ary_push(ary, grapheme); + else + rb_yield(grapheme); + ptr += len; + } + if (wantarray) + return ary; + else + return str; +} + +/* + * call-seq: + * str.each_grapheme {|cstr| block } -> str + * str.each_grapheme -> an_enumerator + * + * Passes each grapheme cluster in <i>str</i> to the given block, or returns + * an enumerator if no block is given. + * Unlike String#each_char, this enumerates by grapheme clusters defined by + * Unicode Standard Annex #29 http://unicode.org/reports/tr29/ + * + * "a\u0300".each_chars.to_a.size #=> 2 + * "a\u0300".each_grapheme.to_a.size #=> 1 + * + */ + +static VALUE +rb_str_each_grapheme(VALUE str) +{ + return rb_str_enumerate_graphemes(str, 0); +} + +/* + * call-seq: + * str.graphemes -> an_array + * + * Returns an array of grapheme clusters in <i>str</i>. This is a shorthand + * for <code>str.each_grapheme.to_a</code>. + * + * If a block is given, which is a deprecated form, works the same as + * <code>each_grapheme</code>. + */ + +static VALUE +rb_str_graphemes(VALUE str) +{ + return rb_str_enumerate_graphemes(str, 1); +} </span> static long chopped_length(VALUE str) <span class="p">@@ -10477,6 +10588,7 @@</span> Init_String(void) rb_define_method(rb_cString, "bytes", rb_str_bytes, 0); rb_define_method(rb_cString, "chars", rb_str_chars, 0); rb_define_method(rb_cString, "codepoints", rb_str_codepoints, 0); <span class="gi">+ rb_define_method(rb_cString, "graphemes", rb_str_graphemes, 0); </span> rb_define_method(rb_cString, "reverse", rb_str_reverse, 0); rb_define_method(rb_cString, "reverse!", rb_str_reverse_bang, 0); rb_define_method(rb_cString, "concat", rb_str_concat_multi, -1); <span class="p">@@ -10532,6 +10644,7 @@</span> Init_String(void) rb_define_method(rb_cString, "each_byte", rb_str_each_byte, 0); rb_define_method(rb_cString, "each_char", rb_str_each_char, 0); rb_define_method(rb_cString, "each_codepoint", rb_str_each_codepoint, 0); <span class="gi">+ rb_define_method(rb_cString, "each_grapheme", rb_str_each_grapheme, 0); </span> rb_define_method(rb_cString, "sum", rb_str_sum, -1); <span class="gh">diff --git a/test/ruby/test_string.rb b/test/ruby/test_string.rb index e88d749123..e3b44725df 100644 </span><span class="gd">--- a/test/ruby/test_string.rb </span><span class="gi">+++ b/test/ruby/test_string.rb </span><span class="p">@@ -885,6 +885,46 @@</span> def test_chars end end <span class="gi">+ def test_each_grapheme + [ + "\u{20 200d}", + "\u{600 600}", + "\u{600 20}", + "\u{261d 1F3FB}", + "\u{1f600}", + "\u{20 308}", + "\u{1F477 1F3FF 200D 2640 FE0F}", + "\u{1F468 200D 1F393}", + "\u{1F46F 200D 2642 FE0F}", + "\u{1f469 200d 2764 fe0f 200d 1f469}", + ].each do |g| + assert_equal [g], g.each_grapheme.to_a + end + + assert_equal ["\u000A", "\u0308"], "\u{a 308}".each_grapheme.to_a + assert_equal ["\u000D", "\u0308"], "\u{d 308}".each_grapheme.to_a + end + + def test_graphemes + [ + "\u{20 200d}", + "\u{600 600}", + "\u{600 20}", + "\u{261d 1F3FB}", + "\u{1f600}", + "\u{20 308}", + "\u{1F477 1F3FF 200D 2640 FE0F}", + "\u{1F468 200D 1F393}", + "\u{1F46F 200D 2642 FE0F}", + "\u{1f469 200d 2764 fe0f 200d 1f469}", + ].each do |g| + assert_equal [g], g.graphemes + end + + assert_equal ["\u000A", "\u0308"], "\u{a 308}".graphemes + assert_equal ["\u000D", "\u0308"], "\u{d 308}".graphemes + end + </span> def test_each_line save = $/ $/ = "\n" <span class="err"> </span></code></pre> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-13T00:15:40Z</p> <ul></ul><p>naruse (Yui NARUSE) wrote:</p> <blockquote> <pre><code class="diff syntaxhl" data-language="diff"><span class="gi">+ if (!unicode_p) { + return rb_str_enumerate_codepoints(str, wantarray); + } </span></code></pre> </blockquote> <p>Why codepoints?</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-14T10:22:02Z</p> <ul></ul><p>nobu (Nobuyoshi Nakada) wrote:</p> <blockquote> <p>naruse (Yui NARUSE) wrote:</p> <blockquote> <pre><code class="diff syntaxhl" data-language="diff"><span class="gi">+ if (!unicode_p) { + return rb_str_enumerate_codepoints(str, wantarray); + } </span></code></pre> </blockquote> <p>Why codepoints?</p> </blockquote> <p>Ah, it should be chars; thanks!</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-14T10:57:58Z</p> <ul></ul><p>Great to see this implemented!</p> <p>One tiny thing I've noticed:</p> <ul> <li>For non-Unicode strings, <code>\X</code> will still match "\r\n" as a single grapheme. This should probably also be the case with <code>String#each_grapheme</code> - or the difference should be clearly documented</li> </ul> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-14T11:00:10Z</p> <ul></ul><p>And a typo in <code>"a\u0300".each_chars.to_a.size #=> 2</code>,<br> should be <code>"a\u0300".each_char.to_a.size #=> 2</code></p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-31T05:32:32Z</p> <ul></ul><p><code>grapheme</code> sounds like an element in the grapheme cluster. How about <code>each_grapheme_cluster</code>?<br> If everyone gets used to the <code>grapheme</code> as an alias of <code>grapheme cluster</code>, we'd love to add an alias <code>each_grapheme</code>.</p> <p>Matz.</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2017-08-31T06:35:33Z</p> <ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Closed</i></li></ul><p>Applied in changeset trunk|r59698.</p> <hr> <p>String#each_grapheme_cluster and String#grapheme_clusters</p> <p>added to enumerate grapheme clusters [Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: String#each_grapheme (Closed)" href="https://bugs.ruby-lang.org/issues/13780">#13780</a>]</p> </article> <article> <h1>Ruby master - Feature #13780: String#each_grapheme</h1> <p>2022-02-01T20:26:09Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/18563">Feature #18563</a>: Add "graphemes" and "each_grapheme" aliases</i> added</li></ul> </article> </main></body></html>