Feature #19908
openUpdate to Unicode 15.1
Description
The Unicode 15.1 is released.
The current enc-unicode.rb seems to fail because of Indic_Conjunct_break
properties with values.
I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/
or /\p{InCB=Liner}/
as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
- Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Updated by duerst (Martin Dürst) about 1 year ago
There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.
Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.
From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.
Updated by duerst (Martin Dürst) about 1 year ago
@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...
、so I think '=' may be appropriate. But Grapheme_Cluster_Break=...
uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=...
, not just InCB=...
?
Updated by duerst (Martin Dürst) about 1 year ago
- Related to Bug #20150: Memory leak in grapheme clusters added
Updated by janosch-x (Janosch Müller) about 1 year ago
Is not this the updated regular expression?
ccs-base := [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
ccs-extend := [\p{M}\p{Join_Control}]
extended_base := ccs-base
| hangul-syllable
-crlf := CR LF
+crlf := CR LF | CR | LF
legacy-core := hangul-syllable
| ri-sequence
| xpicto-sequence
legacy-postcore := [Extend ZWJ]
core := hangul-syllable
| ri-sequence
| xpicto-sequence
+| conjunctCluster
| [^Control CR LF]
postcore := [Extend ZWJ SpacingMark]
precore := Prepend
hangul-syllable := L* (V+ | LV V* | LVT) T*
| L+
| T+
xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+
Updated by duerst (Martin Dürst) about 1 year ago
@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!
Updated by hsbt (Hiroshi SHIBATA) 4 months ago
Unicode 16.0 has been released.
https://www.unicode.org/versions/Unicode16.0.0/
Should we move this instead of 15.1?
Updated by duerst (Martin Dürst) 4 months ago
- Precedes Feature #20724: Update to Unicode 16.0 added
Updated by duerst (Martin Dürst) 4 months ago
hsbt (Hiroshi SHIBATA) wrote in #note-8:
Unicode 16.0 has been released.
Should we move this instead of 15.1?
I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.
Updated by hsbt (Hiroshi SHIBATA) 4 months ago
I think it's more prudent to do 15.1 first, then 16.0.
Agreed, thanks!
Updated by hsbt (Hiroshi SHIBATA) 4 months ago
- Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added
Updated by ima1zumi (Mari Imaizumi) 21 days ago
I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.