https://bugs.ruby-lang.org/
https://bugs.ruby-lang.org/favicon.ico?1711330511
2017-08-03T19:09:01Z
Ruby Issue Tracking System
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66015
2017-08-03T19:09:01Z
shevegen (Robert A. Heiler)
shevegen@gmail.com
<ul></ul><p>My only concern is about the name "grapheme".</p>
<p>I don't know how it is for others but ... this is the first time that I even heard the<br>
term.</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66016
2017-08-03T21:11:31Z
shan (Shannon Skipper)
<ul></ul><p>shevegen (Robert A. Heiler) wrote:</p>
<blockquote>
<p>My only concern is about the name "grapheme".</p>
<p>I don't know how it is for others but ... this is the first time that I even heard the<br>
term.</p>
</blockquote>
<p>I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:</p>
<pre><code>"🇺🇸🇦🇫" |> String.codepoints #=> ["🇺", "🇸", "🇦", "🇫"]
"🇺🇸🇦🇫" |> String.graphemes #=> ["🇺🇸", "🇦🇫"]
</code></pre>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66020
2017-08-04T13:57:10Z
naruse (Yui NARUSE)
naruse@airemix.jp
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li><li><strong>Assignee</strong> set to <i>naruse (Yui NARUSE)</i></li><li><strong>Target version</strong> set to <i>2.5</i></li></ul><p>Accepted.<br>
I'll introduce this in Ruby 2.5.</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66041
2017-08-05T10:50:16Z
naruse (Yui NARUSE)
naruse@airemix.jp
<ul></ul><p>shan (Shannon Skipper) wrote:</p>
<blockquote>
<p>shevegen (Robert A. Heiler) wrote:</p>
<blockquote>
<p>My only concern is about the name "grapheme".</p>
<p>I don't know how it is for others but ... this is the first time that I even heard the<br>
term.</p>
</blockquote>
<p>I think the term is correct and it complements #codepoints and #each_codepoint. In Elixir for example:</p>
</blockquote>
<p>Elixir's <code>grapheme</code> and Swift's <code>Character</code> refer Unicode® Standard Annex #29's "Grapheme Cluster".<br>
<a href="http://unicode.org/reports/tr29/" class="external">http://unicode.org/reports/tr29/</a><br>
The document says grapheme clusters are “user-perceived characters”.</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66163
2017-08-12T17:40:11Z
naruse (Yui NARUSE)
naruse@airemix.jp
<ul></ul><pre><code class="diff syntaxhl" data-language="diff"><span class="gh">diff --git a/NEWS b/NEWS
index 4bfca9240c..1e66e94879 100644
</span><span class="gd">--- a/NEWS
</span><span class="gi">+++ b/NEWS
</span><span class="p">@@ -94,6 +94,7 @@</span> with all sufficient information, see the ChangeLog file or Redmine
* String#delete_prefix! is added to remove prefix destructively [Feature #12694]
* String#delete_suffix is added to remove suffix [Feature #13665]
* String#delete_suffix! is added to remove suffix destructively [Feature #13665]
<span class="gi">+ * String#graphemes is added to enumerate grapheme clusters [Feature #13780]
</span>
* Thread
<span class="gh">diff --git a/string.c b/string.c
index daef497b3d..dd0daa27e9 100644
</span><span class="gd">--- a/string.c
</span><span class="gi">+++ b/string.c
</span><span class="p">@@ -8066,6 +8066,117 @@</span> rb_str_codepoints(VALUE str)
return rb_str_enumerate_codepoints(str, 1);
}
<span class="gi">+static VALUE
+rb_str_enumerate_graphemes(VALUE str, int wantarray)
+{
+ regex_t *reg_grapheme = NULL;
+ static regex_t *reg_grapheme_utf8 = NULL;
+ int encidx = ENCODING_GET(str);
+ rb_encoding *enc = rb_enc_from_index(encidx);
+ int unicode_p = rb_enc_unicode_p(enc);
+ const char *ptr, *end;
+ VALUE ary;
+
+ if (!unicode_p) {
+ return rb_str_enumerate_codepoints(str, wantarray);
+ }
+
+ /* synchronize */
+ if (encidx == rb_utf8_encindex() && reg_grapheme_utf8) {
+ reg_grapheme = reg_grapheme_utf8;
+ }
+ if (!reg_grapheme) {
+ const OnigUChar source[] = "\\X";
+ int r = onig_new(&reg_grapheme, source, source + sizeof(source) - 1,
+ ONIG_OPTION_DEFAULT, enc, OnigDefaultSyntax, NULL);
+ if (r) {
+ rb_bug("cannot compile grapheme cluster regexp");
+ }
+ if (encidx == rb_utf8_encindex()) {
+ reg_grapheme_utf8 = reg_grapheme;
+ }
+ }
+
+ ptr = RSTRING_PTR(str);
+ end = RSTRING_END(str);
+
+ if (rb_block_given_p()) {
+ if (wantarray) {
+#if STRING_ENUMERATORS_WANTARRAY
+ rb_warn("given block not used");
+ ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/
+#else
+ rb_warning("passing a block to String#codepoints is deprecated");
+ wantarray = 0;
+#endif
+ }
+ }
+ else {
+ if (wantarray)
+ ary = rb_ary_new_capa(str_strlen(str, enc)); /* str's enc*/
+ else
+ return SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size);
+ }
+
+ while (ptr < end) {
+ VALUE grapheme;
+ OnigPosition len = onig_match(reg_grapheme,
+ (const OnigUChar *)ptr, (const OnigUChar *)end,
+ (const OnigUChar *)ptr, NULL, 0);
+ if (len == 0) break;
+ if (len < 0) {
+ break;
+ }
+ grapheme = rb_enc_str_new(ptr, len, enc);
+ if (wantarray)
+ rb_ary_push(ary, grapheme);
+ else
+ rb_yield(grapheme);
+ ptr += len;
+ }
+ if (wantarray)
+ return ary;
+ else
+ return str;
+}
+
+/*
+ * call-seq:
+ * str.each_grapheme {|cstr| block } -> str
+ * str.each_grapheme -> an_enumerator
+ *
+ * Passes each grapheme cluster in <i>str</i> to the given block, or returns
+ * an enumerator if no block is given.
+ * Unlike String#each_char, this enumerates by grapheme clusters defined by
+ * Unicode Standard Annex #29 http://unicode.org/reports/tr29/
+ *
+ * "a\u0300".each_chars.to_a.size #=> 2
+ * "a\u0300".each_grapheme.to_a.size #=> 1
+ *
+ */
+
+static VALUE
+rb_str_each_grapheme(VALUE str)
+{
+ return rb_str_enumerate_graphemes(str, 0);
+}
+
+/*
+ * call-seq:
+ * str.graphemes -> an_array
+ *
+ * Returns an array of grapheme clusters in <i>str</i>. This is a shorthand
+ * for <code>str.each_grapheme.to_a</code>.
+ *
+ * If a block is given, which is a deprecated form, works the same as
+ * <code>each_grapheme</code>.
+ */
+
+static VALUE
+rb_str_graphemes(VALUE str)
+{
+ return rb_str_enumerate_graphemes(str, 1);
+}
</span>
static long
chopped_length(VALUE str)
<span class="p">@@ -10477,6 +10588,7 @@</span> Init_String(void)
rb_define_method(rb_cString, "bytes", rb_str_bytes, 0);
rb_define_method(rb_cString, "chars", rb_str_chars, 0);
rb_define_method(rb_cString, "codepoints", rb_str_codepoints, 0);
<span class="gi">+ rb_define_method(rb_cString, "graphemes", rb_str_graphemes, 0);
</span> rb_define_method(rb_cString, "reverse", rb_str_reverse, 0);
rb_define_method(rb_cString, "reverse!", rb_str_reverse_bang, 0);
rb_define_method(rb_cString, "concat", rb_str_concat_multi, -1);
<span class="p">@@ -10532,6 +10644,7 @@</span> Init_String(void)
rb_define_method(rb_cString, "each_byte", rb_str_each_byte, 0);
rb_define_method(rb_cString, "each_char", rb_str_each_char, 0);
rb_define_method(rb_cString, "each_codepoint", rb_str_each_codepoint, 0);
<span class="gi">+ rb_define_method(rb_cString, "each_grapheme", rb_str_each_grapheme, 0);
</span>
rb_define_method(rb_cString, "sum", rb_str_sum, -1);
<span class="gh">diff --git a/test/ruby/test_string.rb b/test/ruby/test_string.rb
index e88d749123..e3b44725df 100644
</span><span class="gd">--- a/test/ruby/test_string.rb
</span><span class="gi">+++ b/test/ruby/test_string.rb
</span><span class="p">@@ -885,6 +885,46 @@</span> def test_chars
end
end
<span class="gi">+ def test_each_grapheme
+ [
+ "\u{20 200d}",
+ "\u{600 600}",
+ "\u{600 20}",
+ "\u{261d 1F3FB}",
+ "\u{1f600}",
+ "\u{20 308}",
+ "\u{1F477 1F3FF 200D 2640 FE0F}",
+ "\u{1F468 200D 1F393}",
+ "\u{1F46F 200D 2642 FE0F}",
+ "\u{1f469 200d 2764 fe0f 200d 1f469}",
+ ].each do |g|
+ assert_equal [g], g.each_grapheme.to_a
+ end
+
+ assert_equal ["\u000A", "\u0308"], "\u{a 308}".each_grapheme.to_a
+ assert_equal ["\u000D", "\u0308"], "\u{d 308}".each_grapheme.to_a
+ end
+
+ def test_graphemes
+ [
+ "\u{20 200d}",
+ "\u{600 600}",
+ "\u{600 20}",
+ "\u{261d 1F3FB}",
+ "\u{1f600}",
+ "\u{20 308}",
+ "\u{1F477 1F3FF 200D 2640 FE0F}",
+ "\u{1F468 200D 1F393}",
+ "\u{1F46F 200D 2642 FE0F}",
+ "\u{1f469 200d 2764 fe0f 200d 1f469}",
+ ].each do |g|
+ assert_equal [g], g.graphemes
+ end
+
+ assert_equal ["\u000A", "\u0308"], "\u{a 308}".graphemes
+ assert_equal ["\u000D", "\u0308"], "\u{d 308}".graphemes
+ end
+
</span> def test_each_line
save = $/
$/ = "\n"
<span class="err">
</span></code></pre>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66164
2017-08-13T00:15:40Z
nobu (Nobuyoshi Nakada)
nobu@ruby-lang.org
<ul></ul><p>naruse (Yui NARUSE) wrote:</p>
<blockquote>
<pre><code class="diff syntaxhl" data-language="diff"><span class="gi">+ if (!unicode_p) {
+ return rb_str_enumerate_codepoints(str, wantarray);
+ }
</span></code></pre>
</blockquote>
<p>Why codepoints?</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66169
2017-08-14T10:22:02Z
naruse (Yui NARUSE)
naruse@airemix.jp
<ul></ul><p>nobu (Nobuyoshi Nakada) wrote:</p>
<blockquote>
<p>naruse (Yui NARUSE) wrote:</p>
<blockquote>
<pre><code class="diff syntaxhl" data-language="diff"><span class="gi">+ if (!unicode_p) {
+ return rb_str_enumerate_codepoints(str, wantarray);
+ }
</span></code></pre>
</blockquote>
<p>Why codepoints?</p>
</blockquote>
<p>Ah, it should be chars; thanks!</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66171
2017-08-14T10:57:58Z
rbjl (Jan Lelis)
hi@ruby.consulting
<ul></ul><p>Great to see this implemented!</p>
<p>One tiny thing I've noticed:</p>
<ul>
<li>For non-Unicode strings, <code>\X</code> will still match "\r\n" as a single grapheme. This should probably also be the case with <code>String#each_grapheme</code> - or the difference should be clearly documented</li>
</ul>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66172
2017-08-14T11:00:10Z
rbjl (Jan Lelis)
hi@ruby.consulting
<ul></ul><p>And a typo in <code>"a\u0300".each_chars.to_a.size #=> 2</code>,<br>
should be <code>"a\u0300".each_char.to_a.size #=> 2</code></p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66354
2017-08-31T05:32:32Z
matz (Yukihiro Matsumoto)
matz@ruby.or.jp
<ul></ul><p><code>grapheme</code> sounds like an element in the grapheme cluster. How about <code>each_grapheme_cluster</code>?<br>
If everyone gets used to the <code>grapheme</code> as an alias of <code>grapheme cluster</code>, we'd love to add an alias <code>each_grapheme</code>.</p>
<p>Matz.</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=66367
2017-08-31T06:35:33Z
naruse (Yui NARUSE)
naruse@airemix.jp
<ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Closed</i></li></ul><p>Applied in changeset trunk|r59698.</p>
<hr>
<p>String#each_grapheme_cluster and String#grapheme_clusters</p>
<p>added to enumerate grapheme clusters [Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: String#each_grapheme (Closed)" href="https://bugs.ruby-lang.org/issues/13780">#13780</a>]</p>
Ruby master - Feature #13780: String#each_grapheme
https://bugs.ruby-lang.org/issues/13780?journal_id=96317
2022-02-01T20:26:09Z
mame (Yusuke Endoh)
mame@ruby-lang.org
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/18563">Feature #18563</a>: Add "graphemes" and "each_grapheme" aliases</i> added</li></ul>