Project

General

Profile

Bug #15343

String#each_grapheme_cluster wrongly splits some emoji (genie, zombie, wrestling)

Added by duerst (Martin Dürst) over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
ruby 2.6.0dev (2018-11-26 trunk 65989) [x86_64-cygwin]
[ruby-core:90073]

Description

All the codepoint combinations that turn up in the various emoji files provided by Unicode (currently we use those at https://www.unicode.org/Public/emoji/5.0/) are recognized as grapheme clusters by String#each_grapheme_cluster, except those relating to genies, zombies, and wrestling (THIS IS NOT A JOKE!).

Taking an example from https://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt, line 396:

$ ./ruby -e '"\u{1F9DE 200D 2640 FE0F}".each_grapheme_cluster.to_a.length.display'
2

The correct result is 1, not 2. The sequence of codepoints represents a woman genie.

I will commit the file test/ruby/enc/test_emoji_breaks.rb, which excludes genie, zombie, and wrestling emoji to make sure the tests pass.

I would like to make sure that this is correct for Unicode 10.0.0 before moving to Unicode 11.0.0. I will try to find out how to fix this by myself, but would definitely appreciate help.


Files

debug_X_genie.txt (30.2 KB) debug_X_genie.txt duerst (Martin Dürst), 11/30/2018 05:14 AM
debug_X_elf.txt (29.9 KB) debug_X_elf.txt duerst (Martin Dürst), 11/30/2018 05:14 AM

Related issues

Blocks Ruby master - Feature #15182: Update extended grapheme cluster implementation for Unicode 11Closedduerst (Martin Dürst)Actions

Also available in: Atom PDF