Project

General

Profile

Actions

Feature #19908

closed

Update to Unicode 15.1

Added by nobu (Nobuyoshi Nakada) over 1 year ago. Updated 8 days ago.

Status:
Closed
Target version:
-
[ruby-core:114936]

Description

The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of Indic_Conjunct_break properties with values.

I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/ or /\p{InCB=Liner}/ as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.


Related issues 4 (2 open2 closed)

Related to Ruby - Bug #10416: Create mechanism for updating of Unicode data files downstreams when we wantAssignednobu (Nobuyoshi Nakada)Actions
Related to Ruby - Bug #20150: Memory leak in grapheme clustersClosedActions
Has duplicate Ruby - Feature #19171: Update Unicode data to Unicode Version 15.1Closedduerst (Martin Dürst)Actions
Precedes Ruby - Feature #20724: Update to Unicode 16.0Assignedduerst (Martin Dürst)Actions
Actions #1

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Actions #2

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

  • Target version deleted (3.3)

Updated by duerst (Martin Dürst) about 1 year ago

There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.

Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.

From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.

Updated by duerst (Martin Dürst) about 1 year ago

@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...、so I think '=' may be appropriate. But Grapheme_Cluster_Break=... uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=..., not just InCB=...?

Actions #5

Updated by duerst (Martin Dürst) about 1 year ago

  • Related to Bug #20150: Memory leak in grapheme clusters added

Updated by janosch-x (Janosch Müller) about 1 year ago

Is not this the updated regular expression?

 ccs-base :=     [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
 ccs-extend :=  [\p{M}\p{Join_Control}]
 extended_base :=       ccs-base
 | hangul-syllable
-crlf :=        CR LF
+crlf :=        CR LF | CR | LF
 legacy-core := hangul-syllable
 | ri-sequence
 | xpicto-sequence
 legacy-postcore :=    [Extend ZWJ]
 core :=        hangul-syllable
 | ri-sequence
 | xpicto-sequence
+| conjunctCluster
 | [^Control CR LF]
 postcore :=    [Extend ZWJ SpacingMark]
 precore :=     Prepend
 hangul-syllable :=    L* (V+ | LV V* | LVT) T*
 | L+
 | T+
 xpicto-sequence :=     \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster :=     \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

Updated by duerst (Martin Dürst) about 1 year ago

@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!

Updated by hsbt (Hiroshi SHIBATA) 7 months ago

Unicode 16.0 has been released.

https://www.unicode.org/versions/Unicode16.0.0/

Should we move this instead of 15.1?

Actions #9

Updated by duerst (Martin Dürst) 7 months ago

Updated by duerst (Martin Dürst) 7 months ago

hsbt (Hiroshi SHIBATA) wrote in #note-8:

Unicode 16.0 has been released.

Should we move this instead of 15.1?

I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.

Updated by hsbt (Hiroshi SHIBATA) 7 months ago

I think it's more prudent to do 15.1 first, then 16.0.

Agreed, thanks!

Actions #12

Updated by hsbt (Hiroshi SHIBATA) 6 months ago

  • Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added

Updated by ima1zumi (Mari Imaizumi) 3 months ago

@duerst (Martin Dürst)

I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.

Updated by naruse (Yui NARUSE) 12 days ago

The change looks good to me.
Since you have already contributed reline and show your engineering skill, and now you also want to contribute to ruby/ruby, I think you should have commit right for ruby/ruby and commit this change by yourself.

@matz (Yukihiro Matsumoto) How do you think?

Updated by ima1zumi (Mari Imaizumi) 12 days ago

@naruse (Yui NARUSE)
Thank you so much for your review and recommending me. I’d be happy to take on commit rights and commit this change myself.

Updated by mame (Yusuke Endoh) 12 days ago

I'd also like to introduce ima1zumi-san as a candidate for committer. She has been actively working on irb and reline, has deep knowledge and a strong interest in character encoding, and is highly recognized, as she was endorsed by @naruse (Yui NARUSE), the maintainer of Ruby's encoding system. With her contributions extending towards Ruby itself, I support her nomination.

Updated by hsbt (Hiroshi SHIBATA) 8 days ago

Thanks, I've finished to prepare your account now.

Actions #28

Updated by ima1zumi (Mari Imaizumi) 8 days ago

  • Status changed from Assigned to Closed

Applied in changeset git|e63c516046b6dbf2f684454b68013b4eea12e94a.


[Feature #19908] Update Unicode headers to 15.1.0

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0Like1Like4Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0Like0