Feature #19908: Update to Unicode 15.1 - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #19908

closed

Update to Unicode 15.1

Added by nobu (Nobuyoshi Nakada) almost 2 years ago. Updated 5 months ago.

Status:

Closed

Assignee:

duerst (Martin Dürst)

Target version:

[ruby-core:114936]

Description

The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of Indic_Conjunct_break properties with values.

I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/ or /\p{InCB=Liner}/ as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago

Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added

Actions

Copy link

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

Target version deleted (~~3.3~~)

Actions

Copy link

#3 [ruby-core:115899]

Updated by duerst (Martin Dürst) over 1 year ago

There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.

Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.

From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.

Actions

Copy link

#4 [ruby-core:115906]

Updated by duerst (Martin Dürst) over 1 year ago

@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...、so I think '=' may be appropriate. But Grapheme_Cluster_Break=... uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=..., not just InCB=...?

Actions

Copy link

Updated by duerst (Martin Dürst) over 1 year ago

Related to Bug #20150: Memory leak in grapheme clusters added

Actions

Copy link

#6 [ruby-core:116056]

Updated by janosch-x (Janosch Müller) over 1 year ago

Is not this the updated regular expression?

 ccs-base :=     [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
 ccs-extend :=  [\p{M}\p{Join_Control}]
 extended_base :=       ccs-base
 | hangul-syllable
-crlf :=        CR LF
+crlf :=        CR LF | CR | LF
 legacy-core := hangul-syllable
 | ri-sequence
 | xpicto-sequence
 legacy-postcore :=    [Extend ZWJ]
 core :=        hangul-syllable
 | ri-sequence
 | xpicto-sequence
+| conjunctCluster
 | [^Control CR LF]
 postcore :=    [Extend ZWJ SpacingMark]
 precore :=     Prepend
 hangul-syllable :=    L* (V+ | LV V* | LVT) T*
 | L+
 | T+
 xpicto-sequence :=     \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster :=     \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

Actions

Copy link

#7 [ruby-core:116099]

Updated by duerst (Martin Dürst) over 1 year ago

@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!

Actions

Copy link

#8 [ruby-core:119128]

Updated by hsbt (Hiroshi SHIBATA) 11 months ago

Unicode 16.0 has been released.

https://www.unicode.org/versions/Unicode16.0.0/

Should we move this instead of 15.1?

Actions

Copy link

Updated by duerst (Martin Dürst) 11 months ago

Precedes Feature #20724: Update to Unicode 16.0 added

Actions

Copy link

#10 [ruby-core:119130]

Updated by duerst (Martin Dürst) 11 months ago

hsbt (Hiroshi SHIBATA) wrote in #note-8:

Unicode 16.0 has been released.

Should we move this instead of 15.1?

I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.

Actions

Copy link

#11 [ruby-core:119131]

Updated by hsbt (Hiroshi SHIBATA) 11 months ago

I think it's more prudent to do 15.1 first, then 16.0.

Agreed, thanks!

Actions

Copy link

#12

Updated by hsbt (Hiroshi SHIBATA) 11 months ago

Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added

Actions

Copy link

#13 [ruby-core:120460]

Updated by ima1zumi (Mari Imaizumi) 7 months ago

@duerst (Martin Dürst)

I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.

Actions

Copy link

#14 [ruby-core:121281]

Updated by mame (Yusuke Endoh) 5 months ago

@duerst (Martin Dürst) What do you think?

Actions

Copy link

#15 [ruby-core:121291]

Updated by ima1zumi (Mari Imaizumi) 5 months ago

I have created a PR to update it.

https://github.com/ruby/ruby/pull/12798

Actions

Copy link

#16 [ruby-core:121364]

Updated by naruse (Yui NARUSE) 5 months ago

The change looks good to me.
Since you have already contributed reline and show your engineering skill, and now you also want to contribute to ruby/ruby, I think you should have commit right for ruby/ruby and commit this change by yourself.

@matz (Yukihiro Matsumoto) How do you think?

Actions

Copy link

#17 [ruby-core:121365]

Updated by ima1zumi (Mari Imaizumi) 5 months ago

@naruse (Yui NARUSE)
Thank you so much for your review and recommending me. I’d be happy to take on commit rights and commit this change myself.

Actions

Copy link

#18 [ruby-core:121366]

Updated by mame (Yusuke Endoh) 5 months ago

I'd also like to introduce ima1zumi-san as a candidate for committer. She has been actively working on irb and reline, has deep knowledge and a strong interest in character encoding, and is highly recognized, as she was endorsed by @naruse (Yui NARUSE), the maintainer of Ruby's encoding system. With her contributions extending towards Ruby itself, I support her nomination.

Actions

Copy link

#19 [ruby-core:121367]