Feature #19908
closedUpdate to Unicode 15.1
Description
The Unicode 15.1 is released.
The current enc-unicode.rb seems to fail because of Indic_Conjunct_break
properties with values.
I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/
or /\p{InCB=Liner}/
as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
- Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Updated by duerst (Martin Dürst) about 1 year ago
There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.
Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.
From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.
Updated by duerst (Martin Dürst) about 1 year ago
@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...
、so I think '=' may be appropriate. But Grapheme_Cluster_Break=...
uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=...
, not just InCB=...
?
Updated by duerst (Martin Dürst) about 1 year ago
- Related to Bug #20150: Memory leak in grapheme clusters added
Updated by janosch-x (Janosch Müller) about 1 year ago
Is not this the updated regular expression?
ccs-base := [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
ccs-extend := [\p{M}\p{Join_Control}]
extended_base := ccs-base
| hangul-syllable
-crlf := CR LF
+crlf := CR LF | CR | LF
legacy-core := hangul-syllable
| ri-sequence
| xpicto-sequence
legacy-postcore := [Extend ZWJ]
core := hangul-syllable
| ri-sequence
| xpicto-sequence
+| conjunctCluster
| [^Control CR LF]
postcore := [Extend ZWJ SpacingMark]
precore := Prepend
hangul-syllable := L* (V+ | LV V* | LVT) T*
| L+
| T+
xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+
Updated by duerst (Martin Dürst) about 1 year ago
@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!
Updated by hsbt (Hiroshi SHIBATA) 7 months ago
Unicode 16.0 has been released.
https://www.unicode.org/versions/Unicode16.0.0/
Should we move this instead of 15.1?
Updated by duerst (Martin Dürst) 7 months ago
- Precedes Feature #20724: Update to Unicode 16.0 added
Updated by duerst (Martin Dürst) 7 months ago
hsbt (Hiroshi SHIBATA) wrote in #note-8:
Unicode 16.0 has been released.
Should we move this instead of 15.1?
I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.
Updated by hsbt (Hiroshi SHIBATA) 7 months ago
I think it's more prudent to do 15.1 first, then 16.0.
Agreed, thanks!
Updated by hsbt (Hiroshi SHIBATA) 6 months ago
- Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added
Updated by ima1zumi (Mari Imaizumi) 3 months ago
I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.
Updated by mame (Yusuke Endoh) 15 days ago
@duerst (Martin Dürst) What do you think?
Updated by ima1zumi (Mari Imaizumi) 15 days ago
I have created a PR to update it.
Updated by naruse (Yui NARUSE) 12 days ago
The change looks good to me.
Since you have already contributed reline and show your engineering skill, and now you also want to contribute to ruby/ruby, I think you should have commit right for ruby/ruby and commit this change by yourself.
@matz (Yukihiro Matsumoto) How do you think?
Updated by ima1zumi (Mari Imaizumi) 12 days ago
@naruse (Yui NARUSE)
Thank you so much for your review and recommending me. I’d be happy to take on commit rights and commit this change myself.
Updated by mame (Yusuke Endoh) 12 days ago
I'd also like to introduce ima1zumi-san as a candidate for committer. She has been actively working on irb and reline, has deep knowledge and a strong interest in character encoding, and is highly recognized, as she was endorsed by @naruse (Yui NARUSE), the maintainer of Ruby's encoding system. With her contributions extending towards Ruby itself, I support her nomination.
Updated by matz (Yukihiro Matsumoto) 9 days ago
#note-16 Approved.
Matz.
Updated by hsbt (Hiroshi SHIBATA) 9 days ago
@ima1zumi (Mari Imaizumi) Can you provide the required information to me? See https://github.com/ruby/ruby/wiki/Committer-How-To#how-to-register-you-as-a-committer in details.
Updated by ima1zumi (Mari Imaizumi) 9 days ago
@hsbt (Hiroshi SHIBATA)
I've sent an email to cvs-admin and opened https://github.com/ruby/git.ruby-lang.org/pull/91
Updated by hsbt (Hiroshi SHIBATA) 8 days ago
Thanks, I've finished to prepare your account now.
Updated by ima1zumi (Mari Imaizumi) 8 days ago
- Status changed from Assigned to Closed
Applied in changeset git|e63c516046b6dbf2f684454b68013b4eea12e94a.
[Feature #19908] Update Unicode headers to 15.1.0