Feature #1889

Teach Onigurma Unicode 5.0 Character Properties

Added by Run Paint Run Run over 4 years ago. Updated almost 3 years ago.

[ruby-core:24775]
Status:Closed
Priority:Low
Assignee:Yui NARUSE
Category:M17N
Target version:1.9.2

Description

=begin
Onigurma understands named category properties such that

0x012c.chr('utf-8')
=> "Ĭ"
0x012c.chr('utf-8') =~ /\p{Lu}/
=> 0

By my reckoning there are about 3,000 characters in the current UnicodeData.txt that it doesn't have property mappings for. For example: U+AA59 (CHAM DIGIT NINE) is in the Nd category (http://unicode.org/cldr/utility/character.jsp?a=AA59) yet:

puts 0xaa59.chr('utf-8')

=> nil
0xaa59.chr('utf-8') =~ /\p{Nd}/
=> nil

I've attached two patches for the two categories I've updated in the hope that somebody familiar with the code can either tell me I'm on the right track, or explain a better approach. :-) If they look OK I'll try adding the remainder.

(The diffs are a bit noisy because I tried to retain the original ordering and layout of the code).
=end

0001-unicode.c-Add-U-2064-INVISIBLE-PLUS-to-Cf-categor.patch Magnifier (1.2 KB) Run Paint Run Run, 08/05/2009 03:34 PM

0002-enc-unicode.c-Add-U-A788-MODIFIER-LETTER-LOW-CIRCU.patch Magnifier (710 Bytes) Run Paint Run Run, 08/05/2009 03:34 PM

History

#1 Updated by Yui NARUSE over 4 years ago

  • Status changed from Open to Assigned
  • Assignee set to Yukihiro Matsumoto

=begin
First, we should decide supporting Unicode version.
After that, we can discuss about whether update it or not.
=end

#2 Updated by Yukihiro Matsumoto over 4 years ago

=begin
Hi,

In message "Re: Feature #1889 Teach Onigurma Unicode 5.0 Character Properties"
on Sat, 15 Aug 2009 16:13:02 +0900, Yui NARUSE redmine@ruby-lang.org writes:

|First, we should decide supporting Unicode version.
|After that, we can discuss about whether update it or not.

\p category information should be updated to the latest as long as our
resource allows.

                        matz.

=end

#3 Updated by Run Paint Run Run over 4 years ago

=begin

|First, we should decide supporting Unicode version.
|After that, we can discuss about whether update it or not.

\p category information should be updated to the latest as long as our
resource allows.

I have a trivial script that parses UnicodeData.txt and looks for
properties unrecognized by Onigurma. If there's an automated process
to update unicode.c, I'll provide the raw data. If, as I suspect, we
need to update unicode.c by hand, adding in each codepoint
individually, I guess I volunteer... In that case, could I get
confirmation that my original patches are along the right lines? If
they are I'll handle this the week after next.

=end

#4 Updated by Run Paint Run Run over 4 years ago

=begin

I have a trivial script that parses UnicodeData.txt and looks for
properties unrecognized by Onigurma. If there's an automated process
to update unicode.c, I'll provide the raw data. If, as I suspect, we
need to update unicode.c by hand, adding in each codepoint
individually, I guess I volunteer...

If that’s truly the case, wouldn’t you rather volunteer to make it so
that unicode.c doesn’t need to be updated manually?

I would, but I doubt I'm smart enough. ;-) I looked around a while
back at how various languages implemented access to Unicode character
metadata, and the results were universally ugly. I saw one, an older
version of the Python approach, IIRC, which consisted of a switch
statement in excess of a thousand lines... It's a classic case of
trading clarity for performance. UnicodeData.txt is 1.1MB, yet the
general use case requires lightning fast lookup. Every approach I've
seen came to the conclusion: the best they could do was automatically
generate array or bit vector literals of the entire database.

Likewise, all I can suggest is that we move the codepoint lists to a
separate file, so we can at least generate that automatically. Again,
though, I don't feel qualified to undertake that by myself. It would
require benchmarking the internals of a regexp library and
understanding the source in sufficient detail to reason about
optimizations...

I haven't looked at the code in a while, but from what I recall each
property is represented by a constant array whose values are the
codepoints in hex, ordered by ordinal value. I suspect I'll start by
writing a test suite in the Mini Unit style that simply iterates over
a local copy of UnicodeData.txt and checks that the given property
matches with Onigurma. I'll take a note of the failures (i.e. the new
characters that aren't yet enumerated in unicode.c). Then, for each
property I'll generate a list, adhering to the current format in
unicode.c, of all codepoints with that property, that I can paste in
over the top of the current list. I can then re-run the test suite to
check I haven't made any mistakes. If that works, it shouldn't be too
difficult.

=end

#5 Updated by Yui NARUSE over 4 years ago

  • Assignee changed from Yukihiro Matsumoto to Yui NARUSE

=begin

In that case, could I get confirmation that my original patches are along the right lines?
If they are I'll handle this the week after next.

CR_* are structured as {length, [from, to]*}.
So your patch should be like following.

0x309d, 0x309e,
0x30fc, 0x30fe,
0xa015, 0xa015,
  • 0xa788, 0xa788, 0xff70, 0xff70, 0xff9e, 0xff9f }; /* CR_Lm */ =end

#6 Updated by Yui NARUSE over 4 years ago

=begin

I looked around a while back at how various languages implemented access to
Unicode character metadata, and the results were universally ugly.

Ruby 1.9's implementation is in onigencunicodeiscodectype in unicode.c and related routines.
Core routine is onigisincoderange in regcomp.c. (this will be fast when CPUs and compilers is enough clever to work parallel)

Likewise, all I can suggest is that we move the codepoint lists to a
separate file, so we can at least generate that automatically.

It seems nice.
If the script which generates a list from UnicodeData.txt runs with miniruby or Ruby 1.8 (baseruby),
we can merge it.
=end

#7 Updated by Run Paint Run Run over 4 years ago

=begin
Yui,

Thanks for your help. :-) I've written a script (http://github.com/runpaint/onig/tree/master) which parses UnicodeData.txt to create the consts for the property mappings. It runs on 1.8.7 and miniruby. The output is at http://gist.github.com/169862 .

IIRC, static consts only have the scope of the file they're declared in. If we're going to move the property consts to a new file, how do you want them declared? If we're to keep the data and logic in the same file then I could produce one patch per property table so as to keep the diffs somewhat readable. Your call. :-)

I haven't tried to update the non-property consts because they mostly rely on other data files. Once we decide how to arrange the consts I'll expand the script to generate the remainder.
=end

#8 Updated by Yui NARUSE over 4 years ago

=begin
My intention is like following:

diff --git a/enc/unicode.c b/enc/unicode.c
index 2dfcbba..f8fd25e 100644
--- a/enc/unicode.c
+++ b/enc/unicode.c
@@ -3863,3820 +3863,7 @@ static const OnigCodePoint CRAssigned[] = {
0x100000, 0x10fffd
}; /* CR
Assigned */

-/* 'C': Major Category /
-static const OnigCodePoint CR_C[] = {
- 422,
- 0x0000, 0x001f,

- 0x3000, 0x3000
-}; /
CR_Zs */
+#include "unicode.h"

/* 'Arabic': Script */
static const OnigCodePoint CR_Arabic[] = {

diff --git a/unicode.h b/unicode.h
new file mode 100644
index 0000000..3758c22
--- /dev/null
+++ b/unicode.h
@@ -0,0 +1,4002 @@
+/* 'C': Major Category /
+static const OnigCodePoint CR_C[] = {
+ 25,
+ 000000, 0x001f,

+ 0x3000, 0x3000,
+}; /
CR_Zs */

unicode.h is the same of your http://gist.github.com/169862.

And I confirmed /\p{Nd}/=~"\uAA59" works with this.
Great!
=end

#9 Updated by Yukihiro Matsumoto over 4 years ago

=begin
Hi,

In message "Re: [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties"
on Wed, 19 Aug 2009 11:57:12 +0900, Yui NARUSE redmine@ruby-lang.org writes:

|My intention is like following:

|--- a/enc/unicode.c
|+++ b/enc/unicode.c

Can you check in?

                        matz.

=end

#10 Updated by Yui NARUSE over 4 years ago

=begin
Yukihiro Matsumoto wrote:

In message "Re: [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties"
on Wed, 19 Aug 2009 11:57:12 +0900, Yui NARUSE redmine@ruby-lang.org writes:

|My intention is like following:

|--- a/enc/unicode.c
|+++ b/enc/unicode.c

Can you check in?

I'll check in when all CR_* are updated.

--
NARUSE, Yui naruse@airemix.jp

=end

#11 Updated by Run Paint Run Run over 4 years ago

=begin
How does http://gist.github.com/170542 look? That's the categories from UnicodeData.txt, the scripts from Scripts.txt, and the POSIX character classes. (The new parser script is still at http://github.com/runpaint/onig/tree/master).

I have used http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt for definitions of the POSIX classes. One that stands out as wrong is :Cntrl:. By the definition in RE.txt it encompasses all members of the C category, but the current CRC const is markedly different from the CRCntrl one. This is what I have ATM:

 87   # TODO: Double check this definition. It appears to encompass the entire C
 88   # category, but currently the CR blocks for C and Cntrl are markedly different
 89   # cntrl    Control | Format | Unassigned | Private_Use | Surrogate
 90   data['Cntrl'] = data['Cc'] + data['Cf'] + data['Cn'] + data['Co'] +
 91                   data['Cs']

I'm defining Cn as any character in the Unicode range that does not appear in UnicodeData.txt. Any insights into how this class is defined?

There are 15 new scripts there, e.g. 'Vai'. These will need to be added to the '#ifdef USEUNICODEPROPERTIES' section, starting on line 10632, and the similar section starting on line 10507. For the former, what does the final digit in the row signify? For example, in the following what does 8 mean?

{ (UChar* )"Ethiopic", 69, 8 },
=end

#12 Updated by Nobuyoshi Nakada over 4 years ago

=begin
Hi,

At Thu, 20 Aug 2009 03:58:22 +0900,
Run Paint Run Run wrote in :

There are 15 new scripts there, e.g. 'Vai'. These will need to be added to the '#ifdef USEUNICODEPROPERTIES' section, starting on line 10632, and the similar section starting on line 10507. For the former, what does the final digit in the row signify? For example, in the following what does 8 mean?

{ (UChar* )"Ethiopic", 69, 8 },

It is the length of the name. But I'd like to make them a
perfect hash.

--
Nobu Nakada

=end

#13 Updated by Yui NARUSE over 4 years ago

  • Target version set to 1.9.2

=begin

One that stands out as wrong is :Cntrl:.
I think, CR_Cntrl is correct.
http://unicode.org/reports/tr18/

How about do you think, matz?

I'm defining Cn as any character in the Unicode range that does not appear in UnicodeData.txt.
Any insights into how this class is defined?

Scripts.txt says the same thing.

All code points not explicitly listed for Script

have the value Unknown (Zzzz).

@missing: 0000..10FFFF; Unknown

http://www.unicode.org/Public/UNIDATA/Scripts.txt
=end

#14 Updated by Yui NARUSE over 4 years ago

=begin
I updated your script and uploaded on http://github.com/nurse/onig/tree/master

And I published my fork of Ruby which is applied this change.
http://github.com/nurse/ruby/tree/onig-unicode
=end

#15 Updated by Run Paint Run Run over 4 years ago

=begin

I updated your script and uploaded on http://github.com/nurse/onig/tree/master

And I published my fork of Ruby which is applied this change.
http://github.com/nurse/ruby/tree/onig-unicode

Thanks! :-) I'm traveling at the moment, but should be able to have a
look tomorrow at the Ruby changes. Is there anything left to be done
here?

=end

#16 Updated by Martin Dürst over 4 years ago

=begin
I fully agree. One could even go as far as having a policy to use the
Unicode beta versions (5.2 at this time;
http://www.unicode.org/versions/Unicode5.2.0/) for trunk and the latest
Unicode stable version (currently 5.1;
http://www.unicode.org/versions/Unicode5.1.0/) for the stable branches.

Regards, Martin.

On 2009/08/17 7:20, Yukihiro Matsumoto wrote:

Hi,

In message "Re: Feature #1889 Teach Onigurma Unicode 5.0 Character Properties"
on Sat, 15 Aug 2009 16:13:02 +0900, Yui NARUSEredmine@ruby-lang.org writes:

|First, we should decide supporting Unicode version.
|After that, we can discuss about whether update it or not.

\p category information should be updated to the latest as long as our
resource allows.

                     matz.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#17 Updated by Yui NARUSE over 4 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

=begin
Applied in changeset r24651.
=end

#18 Updated by Yui NARUSE over 4 years ago

=begin
I applied this change, thanks.

I'll apply latest stable Unicode Data because tracking Beta version needs more resource.
=end

#19 Updated by Martin Dürst over 4 years ago

=begin
On 2009/08/26 2:06, Yui NARUSE wrote:

Issue #1889 has been updated by Yui NARUSE.

I applied this change, thanks.

I'll apply latest stable Unicode Data because tracking Beta version needs more resource.

Understood. I propose to wait for the maintainers of more stable
releases to integrate this patch, and then I will take over the
responsibility of upgrading trunk to 5.2 beta and keep it reasonably
updated.

Regards, Martin.


http://redmine.ruby-lang.org/issues/show/1889

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#20 Updated by Yui NARUSE over 4 years ago

=begin
I see.
ruby19_2 release branch will be created sooner.
=end

#21 Updated by Martin Dürst over 4 years ago

=begin
Hello Yui, others,

[I'd really like to hear from Yugui, because she is responsible for
1.9.1 and 1.9.2.]

On 2009/08/26 18:46, Yui NARUSE wrote:

Issue #1889 has been updated by Yui NARUSE.

I see.
ruby19_2 release branch will be created sooner.

Good point. According to , there is a feature freeze on
Sept. 25. The release of Unicode 5.2 (final!) is planned for October
2009 (see to http://www.unicode.org/versions/beta.html).

[My personal guess is that this might happen in the week of October 12,
you can guess the reason for why I guess this date at
http://www.unicodeconference.org/. This would be before release
candidate 1 of 1.9.2.]

Last year, additions of transcodings (in essence just more data) were
allowed even after the feature freeze. In my view, moving to the latest
stable Unicode data version is very similar. Another way to think about
it is that it's possible to include Unicode 5.2 beta in 1.9.2 while
1.9.2 is not yet final. This runs the risk that we have to move back
from Unicode 5.2 to Unicode 5.1 if Unicode 5.2 doesn't go final before
December or so, but I consider this risk very low (the Unicode
consortium has an extremely well established release process). On the
other hand, I consider the fact that a final Ruby release contains the
latest stable Unicode data a big plus, both for usage and for
'marketing'. Also, if I were the maintainer of one of the 'earlier'
branches, I would try to follow stable Unicode versions, too.

So my proposal would be:
- Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to
update to latest stable Unicode version.
- Move to Unicode 5.2 (beta) for trunk and 1.9.2.
- Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated.
- Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when
Unicode 5.2 goes final.

My main concern currently would be that, as far as I understand, not all
properties are currently automatically updated. But I think that can be
fixed by September 25th.

Regards, Martin.


http://redmine.ruby-lang.org/issues/show/1889


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#22 Updated by Martin Dürst over 4 years ago

=begin
In this context, please also see
http://www.unicode.org/mail-arch/unicode-ml/y2009-m08/0207.html, which
says (From: announcements@unicode.org; Date: Wed Aug 26 2009 - 19:48:45
CDT):

The data files in the Unicode Character Database for Unicode 5.2 have
been revised to include all of the authorized changes from the last UTC
meeting. If you use any of the Unicode data in your implementations,
please update a test version of your implementation to use those files
and run your tests. If there are any showstopper bugs, please report
them (using http://www.unicode.org/reporting.html) as soon as possible.

From this point, the only adjustments that will be made to the data
will be on the basis of showstopper bugs, including bugs uncovered in
the process of updating the Unicode Collation data files for UCA 5.2.

This means:
a) Unicode 5.2 is close to being ready for release in October.
b) Implementations (such as Ruby) should test the data.

Regards, Martin.

On 2009/08/26 19:39, Martin J. Dürst wrote:

Hello Yui, others,

[I'd really like to hear from Yugui, because she is responsible for
1.9.1 and 1.9.2.]

On 2009/08/26 18:46, Yui NARUSE wrote:

Issue #1889 has been updated by Yui NARUSE.

I see.
ruby19_2 release branch will be created sooner.

Good point. According to , there is a feature freeze on
Sept. 25. The release of Unicode 5.2 (final!) is planned for October
2009 (see to http://www.unicode.org/versions/beta.html).

[My personal guess is that this might happen in the week of October 12,
you can guess the reason for why I guess this date at
http://www.unicodeconference.org/. This would be before release
candidate 1 of 1.9.2.]

Last year, additions of transcodings (in essence just more data) were
allowed even after the feature freeze. In my view, moving to the latest
stable Unicode data version is very similar. Another way to think about
it is that it's possible to include Unicode 5.2 beta in 1.9.2 while
1.9.2 is not yet final. This runs the risk that we have to move back
from Unicode 5.2 to Unicode 5.1 if Unicode 5.2 doesn't go final before
December or so, but I consider this risk very low (the Unicode
consortium has an extremely well established release process). On the
other hand, I consider the fact that a final Ruby release contains the
latest stable Unicode data a big plus, both for usage and for
'marketing'. Also, if I were the maintainer of one of the 'earlier'
branches, I would try to follow stable Unicode versions, too.

So my proposal would be:
- Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to
update to latest stable Unicode version.
- Move to Unicode 5.2 (beta) for trunk and 1.9.2.
- Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated.
- Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when
Unicode 5.2 goes final.

My main concern currently would be that, as far as I understand, not all
properties are currently automatically updated. But I think that can be
fixed by September 25th.

Regards, Martin.


http://redmine.ruby-lang.org/issues/show/1889


http://redmine.ruby-lang.org

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#23 Updated by Yui NARUSE over 4 years ago

=begin
Thank you for information.
I tested the data, and found & fixed a bug in Oniguruma. (r24677)

This bug is from original Oniguruma.
Oniguruma limited the max length of a property name to 20.
This raises a bug on Unicode 5.2.

If you update data for Ruby, update enc/unicode/UnicodeData.txt and Scripts.txt, and run
ruby tool/enc-unicode.rb enc/unicode/UnicodeData.txt enc/unicode/Scripts.txt > enc/unicode/name2ctype.kwd

So my proposal would be:
- Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to
update to latest stable Unicode version.

This is depend on Yugui's policy, but 1.9.1 seems to be leave as is.

  • Move to Unicode 5.2 (beta) for trunk and 1.9.2.
    I tried /UnicodeData-5.2.0d12.txt and Scripts-5.2.0d13.txt, and it works.
    So if Unicode 5.2 is released on October, this is available option.

  • Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated.
    If Ruby 1.9.2 is Unicode 5.2, I agree this.

  • Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when
    Unicode 5.2 goes final.
    If Ruby 1.9.2 is Unicode 5.2, I agree this except 1.9.1.
    =end

#24 Updated by Run Paint Run Run over 4 years ago

=begin
It appears that http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt did not accurately reflect the behaviour on Onigurma before this patch. For example, :word: used to match No and Nl characters; but now it doesn't. :print:, :graph:, and :cntrl: used to match private-use and format characters; now they don't. It's an easy fix, either way, but it would be nice to have the specs agree with reality.
=end

#25 Updated by Yui NARUSE over 4 years ago

=begin
RE.txt is for original Oniguruma, not for Ruby 1.9's regexp.
We may need our own document.

I think, POSIX properties should follow UTS18.
If you have a patch, I'll merge it.
=end

#26 Updated by Run Paint Run Run over 4 years ago

=begin

RE.txt is for original Oniguruma, not for Ruby 1.9's regexp.
We may need our own document.

Absolutely. :-) Can I open a ticket?

I think, POSIX properties should follow UTS18.

UTS18 defines :word: as: \p{alpha}, \p{gc=Mark}, \p{digit}, \p{gc=Connector_Punctuation}. Where 'digit' is defined elsewhere as Nd. So :word: shouldn't match No and Nl, which means that the current version is right, and the old wrong.

:print: is defined as \p{graph} \p{blank} -- \p{cntrl}, where '--' means 'set difference'.
:graph: is defined as [\p{space} \p{gc=Control} \p{gc=Surrogate} \p{gc=Unassigned}]. Private use characters are in the Co category, so neither :graph: or :print: encompasses them. Format characters are in category Cf, so neither :graph: nor :print: includes them.

:cntrl: is defined as \p{gc=Control}, i.e. members of Cc (not C, as you may expect). This again excludes format and private-use characters.

So it appears that the patch was correct, and the previous Onigurma was wrong. I'll leave it you decide whether this needs backporting. Sorry for the trouble.
=end

#27 Updated by Yui NARUSE over 4 years ago

=begin

RE.txt is for original Oniguruma, not for Ruby 1.9's regexp.
We may need our own document.

Absolutely. :-) Can I open a ticket?

Yes, please.

I think, this change is not a bugfix.
But whether backport a patch or not is depend on stable maintainer.
If you want to backport, open a ticket on 1.9.1 as backport and persuade yugui.
=end

#28 Updated by Yukihiro Matsumoto over 4 years ago

=begin
Hi,

In message "Re: [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties"
on Sun, 6 Sep 2009 01:46:46 +0900, Run Paint Run Run redmine@ruby-lang.org writes:

|It appears that http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt did not accurately reflect the behaviour on Onigurma before this patch.

Our Oniguruma is forked one. The original Oniguruma found in
geocities.jp has not been changed.

                        matz.

=end

Also available in: Atom PDF