Project

General

Profile

Bug #15343

String#each_grapheme_cluster wrongly splits some emoji (genie, zombie, wrestling)

Added by duerst (Martin Dürst) over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
ruby 2.6.0dev (2018-11-26 trunk 65989) [x86_64-cygwin]
[ruby-core:90073]

Description

All the codepoint combinations that turn up in the various emoji files provided by Unicode (currently we use those at https://www.unicode.org/Public/emoji/5.0/) are recognized as grapheme clusters by String#each_grapheme_cluster, except those relating to genies, zombies, and wrestling (THIS IS NOT A JOKE!).

Taking an example from https://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt, line 396:

$ ./ruby -e '"\u{1F9DE 200D 2640 FE0F}".each_grapheme_cluster.to_a.length.display'
2

The correct result is 1, not 2. The sequence of codepoints represents a woman genie.

I will commit the file test/ruby/enc/test_emoji_breaks.rb, which excludes genie, zombie, and wrestling emoji to make sure the tests pass.

I would like to make sure that this is correct for Unicode 10.0.0 before moving to Unicode 11.0.0. I will try to find out how to fix this by myself, but would definitely appreciate help.


Files

debug_X_genie.txt (30.2 KB) debug_X_genie.txt duerst (Martin Dürst), 11/30/2018 05:14 AM
debug_X_elf.txt (29.9 KB) debug_X_elf.txt duerst (Martin Dürst), 11/30/2018 05:14 AM

Related issues

Blocks Ruby master - Feature #15182: Update extended grapheme cluster implementation for Unicode 11Closedduerst (Martin Dürst)Actions
#1

Updated by duerst (Martin Dürst) over 1 year ago

  • Blocks Feature #15182: Update extended grapheme cluster implementation for Unicode 11 added

Updated by shevegen (Robert A. Heiler) over 1 year ago

This issue is epic due to its title alone! (I don't quite know whether
there are indeed genie and zombie emojis yet but it makes me curious.)

except those relating to genies, zombies, and wrestling (THIS IS NOT
A JOKE!).

Awww .... :)

Updated by duerst (Martin Dürst) over 1 year ago

Some data points from a discussion between naruse (Yui NARUSE) and myself:

  • Up to elf (U+1F9DD) is Emoji_Modifier_Base, but genie (U+1F9DE) isn't.

  • Emoji_Modifier only includes skin tones (U+1F3FB-1F3FF, light skin tone..dark skin tone)

  • For experts, that seems to make sense, because there are apparently light and dark elves, but all the zombies have the same half-dead skin color.

  • For 'wrestling' again, it doesn't allow skin colors.

  • So the error seems to appear when an emoji takes male/female specifiers, but isn't allowed to take skin tones.

  • As we are going to rewrite the underlying implementation (function node_extended_grapheme_cluster in regparse.c), we may not care to fix this bug anymore. But if somebody finds a fix, they may want to apply it to older versions of Ruby (2.5 and 2.4).

Updated by duerst (Martin Dürst) over 1 year ago

I had my computer spend about 10h to compile Ruby with regexp debug flags activated. It took that long because while Ruby is building, it starts running Ruby scripts with lots of regexp debug output. (I probably should have deactivated document building and used 2>/dev/null for a bit of speedup.)

Then I was able to try out the above example, attached as debug_X_genie.txt (the exact command was: ./ruby --disable-gems -e '"\u{1F9DE 200D 2640 FE0F}" =~ /\X/' 2>debug_X_genie.txt).

I also did the same for the 'elf' emoji: ./ruby --disable-gems -e '"\u{1F9DD 200D 2640 FE0F}" =~ /\X/' 2>debug_X_elf.txt. File also attached.

The files only differ at the end, when the actual match happens.

Updated by duerst (Martin Dürst) over 1 year ago

  • Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN to 2.4: UNKNOWN, 2.5: REQUIRED
  • Assignee changed from naruse (Yui NARUSE) to duerst (Martin Dürst)
  • Status changed from Open to Closed

Working through Unicode Standard Annex #29 (version 31, for Unicode 10.0.0), I'm not sure all of the code in node_extended_grapheme_cluster() (in regparse.c) is perfect. But this solves an obvious bug, and we'll leave it at that for Unicode 10.0.0.

Also available in: Atom PDF