Feature #1889
Teach Onigurma Unicode 5.0 Character Properties
| Status: | Closed | Start date: | 08/05/2009 | |
|---|---|---|---|---|
| Priority: | Low | Due date: | ||
| Assignee: | % Done: | 100% |
||
| Category: | M17N | |||
| Target version: | 1.9.2 |
Description
Onigurma understands named category properties such that
>> 0x012c.chr('utf-8')
=> "Ĭ"
>> 0x012c.chr('utf-8') =~ /\p{Lu}/
=> 0
By my reckoning there are about 3,000 characters in the current UnicodeData.txt that it doesn't have property mappings for. For example: U+AA59 (CHAM DIGIT NINE) is in the Nd category (http://unicode.org/cldr/utility/character.jsp?a=AA59) yet:
>> puts 0xaa59.chr('utf-8')
꩙
=> nil
>> 0xaa59.chr('utf-8') =~ /\p{Nd}/
=> nil
I've attached two patches for the two categories I've updated in the hope that somebody familiar with the code can either tell me I'm on the right track, or explain a better approach. :-) If they look OK I'll try adding the remainder.
(The diffs are a bit noisy because I tried to retain the original ordering and layout of the code).
Associated revisions
Update Oniguruma's UnicodeData to 5.1.
* tool/enc-unicode.rb: added for generate name2ctype.kwd.
contributed by Run Paint Run Run [ruby-core:24775]
use like following:
ruby19 tool/enc-unicode.rb enc/unicode/UnicodeData.txt \
enc/unicode/Scripts.txt > enc/unicode/name2ctype.kwd
* enc/unicode.c (CodeRanges): move definitions to name2ctype.h.
* enc/unicode/name2ctype.h.blt, enc/unicode/name2ctype.kwd,
enc/unicode/name2ctype.src: updated to v5.1.
* enc/unicode/UnicodeData.txt, enc/unicode/Scripts.txt: added v5.1.
* Makefile.in: add rule to generate name2ctype.kwd from
UnicodeData.txt and Scripts.txt.
History
Updated by Yui NARUSE over 2 years ago
- Status changed from Open to Assigned
- Assignee set to Yukihiro Matsumoto
First, we should decide supporting Unicode version. After that, we can discuss about whether update it or not.
Updated by Yukihiro Matsumoto over 2 years ago
Hi, In message "Re: [ruby-core:24928] [Feature #1889](Assigned) Teach Onigurma Unicode 5.0 Character Properties" on Sat, 15 Aug 2009 16:13:02 +0900, Yui NARUSE <redmine@ruby-lang.org> writes: |First, we should decide supporting Unicode version. |After that, we can discuss about whether update it or not. \p category information should be updated to the latest as long as our resource allows. matz.
Updated by Run Paint Run Run over 2 years ago
> |First, we should decide supporting Unicode version. > |After that, we can discuss about whether update it or not. > > \p category information should be updated to the latest as long as our > resource allows. I have a trivial script that parses UnicodeData.txt and looks for properties unrecognized by Onigurma. If there's an automated process to update unicode.c, I'll provide the raw data. If, as I suspect, we need to update unicode.c by hand, adding in each codepoint individually, I guess I volunteer... In that case, could I get confirmation that my original patches are along the right lines? If they are I'll handle this the week after next.
Updated by Run Paint Run Run over 2 years ago
>> I have a trivial script that parses UnicodeData.txt and looks for >> properties unrecognized by Onigurma. If there's an automated process >> to update unicode.c, I'll provide the raw data. If, as I suspect, we >> need to update unicode.c by hand, adding in each codepoint >> individually, I guess I volunteer... > > If that’s truly the case, wouldn’t you rather volunteer to make it so > that unicode.c doesn’t need to be updated manually? I would, but I doubt I'm smart enough. ;-) I looked around a while back at how various languages implemented access to Unicode character metadata, and the results were universally ugly. I saw one, an older version of the Python approach, IIRC, which consisted of a switch statement in excess of a thousand lines... It's a classic case of trading clarity for performance. UnicodeData.txt is 1.1MB, yet the general use case requires lightning fast lookup. Every approach I've seen came to the conclusion: the best they could do was automatically generate array or bit vector literals of the entire database. Likewise, all I can suggest is that we move the codepoint lists to a separate file, so we can at least generate that automatically. Again, though, I don't feel qualified to undertake that by myself. It would require benchmarking the internals of a regexp library and understanding the source in sufficient detail to reason about optimizations... I haven't looked at the code in a while, but from what I recall each property is represented by a constant array whose values are the codepoints in hex, ordered by ordinal value. I suspect I'll start by writing a test suite in the Mini Unit style that simply iterates over a local copy of UnicodeData.txt and checks that the given property matches with Onigurma. I'll take a note of the failures (i.e. the new characters that aren't yet enumerated in unicode.c). Then, for each property I'll generate a list, adhering to the current format in unicode.c, of _all_ codepoints with that property, that I can paste in over the top of the current list. I can then re-run the test suite to check I haven't made any mistakes. If that works, it shouldn't be too difficult.
Updated by Yui NARUSE over 2 years ago
- Assignee changed from Yukihiro Matsumoto to Yui NARUSE
> In that case, could I get confirmation that my original patches are along the right lines?
> If they are I'll handle this the week after next.
CR_* are structured as {length, [from, to]*}.
So your patch should be like following.
0x309d, 0x309e,
0x30fc, 0x30fe,
0xa015, 0xa015,
+ 0xa788, 0xa788,
0xff70, 0xff70,
0xff9e, 0xff9f
}; /* CR_Lm */
Updated by Yui NARUSE over 2 years ago
> I looked around a while back at how various languages implemented access to > Unicode character metadata, and the results were universally ugly. Ruby 1.9's implementation is in onigenc_unicode_is_code_ctype in unicode.c and related routines. Core routine is onig_is_in_code_range in regcomp.c. (this will be fast when CPUs and compilers is enough clever to work parallel) > Likewise, all I can suggest is that we move the codepoint lists to a > separate file, so we can at least generate that automatically. It seems nice. If the script which generates a list from UnicodeData.txt runs with miniruby or Ruby 1.8 (baseruby), we can merge it.
Updated by Run Paint Run Run over 2 years ago
Yui, Thanks for your help. :-) I've written a script (http://github.com/runpaint/onig/tree/master) which parses UnicodeData.txt to create the consts for the property mappings. It runs on 1.8.7 and `miniruby`. The output is at http://gist.github.com/169862 . IIRC, static consts only have the scope of the file they're declared in. If we're going to move the property consts to a new file, how do you want them declared? If we're to keep the data and logic in the same file then I could produce one patch per property table so as to keep the diffs somewhat readable. Your call. :-) I haven't tried to update the non-property consts because they mostly rely on other data files. Once we decide how to arrange the consts I'll expand the script to generate the remainder.
Updated by Yui NARUSE over 2 years ago
My intention is like following:
diff --git a/enc/unicode.c b/enc/unicode.c
index 2dfcbba..f8fd25e 100644
--- a/enc/unicode.c
+++ b/enc/unicode.c
@@ -3863,3820 +3863,7 @@ static const OnigCodePoint CR_Assigned[] = {
0x100000, 0x10fffd
}; /* CR_Assigned */
-/* 'C': Major Category */
-static const OnigCodePoint CR_C[] = {
- 422,
- 0x0000, 0x001f,
<snip>
- 0x3000, 0x3000
-}; /* CR_Zs */
+#include "unicode.h"
/* 'Arabic': Script */
static const OnigCodePoint CR_Arabic[] = {
diff --git a/unicode.h b/unicode.h
new file mode 100644
index 0000000..3758c22
--- /dev/null
+++ b/unicode.h
@@ -0,0 +1,4002 @@
+/* 'C': Major Category */
+static const OnigCodePoint CR_C[] = {
+ 25,
+ 000000, 0x001f,
<snip>
+ 0x3000, 0x3000,
+}; /* CR_Zs */
unicode.h is the same of your http://gist.github.com/169862.
And I confirmed /\p{Nd}/=~"\uAA59" works with this.
Great!
Updated by Yukihiro Matsumoto over 2 years ago
Hi, In message "Re: [ruby-core:24975] [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties" on Wed, 19 Aug 2009 11:57:12 +0900, Yui NARUSE <redmine@ruby-lang.org> writes: |My intention is like following: |--- a/enc/unicode.c |+++ b/enc/unicode.c Can you check in? matz.
Updated by Yui NARUSE over 2 years ago
Yukihiro Matsumoto wrote: > In message "Re: [ruby-core:24975] [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties" > on Wed, 19 Aug 2009 11:57:12 +0900, Yui NARUSE <redmine@ruby-lang.org> writes: > > |My intention is like following: > > |--- a/enc/unicode.c > |+++ b/enc/unicode.c > > Can you check in? I'll check in when all CR_* are updated. -- NARUSE, Yui <naruse@airemix.jp>
Updated by Run Paint Run Run over 2 years ago
How does http://gist.github.com/170542 look? That's the categories from UnicodeData.txt, the scripts from Scripts.txt, and the POSIX character classes. (The new parser script is still at http://github.com/runpaint/onig/tree/master).
I have used http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt for definitions of the POSIX classes. One that stands out as wrong is [[:Cntrl:]]. By the definition in RE.txt it encompasses all members of the C category, but the current CR_C const is markedly different from the CR_Cntrl one. This is what I have ATM:
87 # TODO: Double check this definition. It appears to encompass the entire C
88 # category, but currently the CR blocks for C and Cntrl are markedly different
89 # cntrl Control | Format | Unassigned | Private_Use | Surrogate
90 data['Cntrl'] = data['Cc'] + data['Cf'] + data['Cn'] + data['Co'] +
91 data['Cs']
I'm defining Cn as any character in the Unicode range that does not appear in UnicodeData.txt. Any insights into how this class is defined?
There are 15 new scripts there, e.g. 'Vai'. These will need to be added to the '#ifdef USE_UNICODE_PROPERTIES' section, starting on line 10632, and the similar section starting on line 10507. For the former, what does the final digit in the row signify? For example, in the following what does 8 mean?
{ (UChar* )"Ethiopic", 69, 8 },
Updated by Nobuyoshi Nakada over 2 years ago
Hi, At Thu, 20 Aug 2009 03:58:22 +0900, Run Paint Run Run wrote in [ruby-core:24984]: > There are 15 new scripts there, e.g. 'Vai'. These will need to be added to the '#ifdef USE_UNICODE_PROPERTIES' section, starting on line 10632, and the similar section starting on line 10507. For the former, what does the final digit in the row signify? For example, in the following what does 8 mean? > > { (UChar* )"Ethiopic", 69, 8 }, It is the length of the name. But I'd like to make them a perfect hash. -- Nobu Nakada
Updated by Yui NARUSE over 2 years ago
- Target version set to 1.9.2
> One that stands out as wrong is [[:Cntrl:]]. I think, CR_Cntrl is correct. http://unicode.org/reports/tr18/ How about do you think, matz? > I'm defining Cn as any character in the Unicode range that does not appear in UnicodeData.txt. > Any insights into how this class is defined? Scripts.txt says the same thing. > # All code points not explicitly listed for Script > # have the value Unknown (Zzzz). > # @missing: 0000..10FFFF; Unknown http://www.unicode.org/Public/UNIDATA/Scripts.txt
Updated by Yui NARUSE over 2 years ago
I updated your script and uploaded on http://github.com/nurse/onig/tree/master And I published my fork of Ruby which is applied this change. http://github.com/nurse/ruby/tree/onig-unicode
Updated by Run Paint Run Run over 2 years ago
> I updated your script and uploaded on http://github.com/nurse/onig/tree/master > > And I published my fork of Ruby which is applied this change. > http://github.com/nurse/ruby/tree/onig-unicode Thanks! :-) I'm traveling at the moment, but should be able to have a look tomorrow at the Ruby changes. Is there anything left to be done here?
Updated by Martin Dürst over 2 years ago
I fully agree. One could even go as far as having a policy to use the Unicode beta versions (5.2 at this time; http://www.unicode.org/versions/Unicode5.2.0/) for trunk and the latest Unicode stable version (currently 5.1; http://www.unicode.org/versions/Unicode5.1.0/) for the stable branches. Regards, Martin. On 2009/08/17 7:20, Yukihiro Matsumoto wrote: > Hi, > > In message "Re: [ruby-core:24928] [Feature #1889](Assigned) Teach Onigurma Unicode 5.0 Character Properties" > on Sat, 15 Aug 2009 16:13:02 +0900, Yui NARUSE<redmine@ruby-lang.org> writes: > > |First, we should decide supporting Unicode version. > |After that, we can discuss about whether update it or not. > > \p category information should be updated to the latest as long as our > resource allows. > > matz. > > > -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Updated by Yui NARUSE over 2 years ago
- Status changed from Assigned to Closed
- % Done changed from 0 to 100
Applied in changeset r24651.
Updated by Yui NARUSE over 2 years ago
I applied this change, thanks. I'll apply latest stable Unicode Data because tracking Beta version needs more resource.
Updated by Martin Dürst over 2 years ago
On 2009/08/26 2:06, Yui NARUSE wrote: > Issue #1889 has been updated by Yui NARUSE. > > > I applied this change, thanks. > > I'll apply latest stable Unicode Data because tracking Beta version needs more resource. Understood. I propose to wait for the maintainers of more stable releases to integrate this patch, and then I will take over the responsibility of upgrading trunk to 5.2 beta and keep it reasonably updated. Regards, Martin. > ---------------------------------------- > http://redmine.ruby-lang.org/issues/show/1889 -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Updated by Yui NARUSE over 2 years ago
I see. ruby_1_9_2 release branch will be created sooner.
Updated by Martin Dürst over 2 years ago
Hello Yui, others, [I'd really like to hear from Yugui, because she is responsible for 1.9.1 and 1.9.2.] On 2009/08/26 18:46, Yui NARUSE wrote: > Issue #1889 has been updated by Yui NARUSE. > > > I see. > ruby_1_9_2 release branch will be created sooner. Good point. According to [ruby-core:23977], there is a feature freeze on Sept. 25. The release of Unicode 5.2 (final!) is planned for October 2009 (see to http://www.unicode.org/versions/beta.html). [My personal guess is that this might happen in the week of October 12, you can guess the reason for why I guess this date at http://www.unicodeconference.org/. This would be before release candidate 1 of 1.9.2.] Last year, additions of transcodings (in essence just more data) were allowed even after the feature freeze. In my view, moving to the latest stable Unicode data version is very similar. Another way to think about it is that it's possible to include Unicode 5.2 beta in 1.9.2 while 1.9.2 is not yet final. This runs the risk that we have to move back from Unicode 5.2 to Unicode 5.1 if Unicode 5.2 doesn't go final before December or so, but I consider this risk very low (the Unicode consortium has an extremely well established release process). On the other hand, I consider the fact that a final Ruby release contains the latest stable Unicode data a big plus, both for usage and for 'marketing'. Also, if I were the maintainer of one of the 'earlier' branches, I would try to follow stable Unicode versions, too. So my proposal would be: - Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to update to latest stable Unicode version. - Move to Unicode 5.2 (beta) for trunk and 1.9.2. - Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated. - Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when Unicode 5.2 goes final. My main concern currently would be that, as far as I understand, not all properties are currently automatically updated. But I think that can be fixed by September 25th. Regards, Martin. > ---------------------------------------- > http://redmine.ruby-lang.org/issues/show/1889 > > ---------------------------------------- > http://redmine.ruby-lang.org -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Updated by Martin Dürst over 2 years ago
In this context, please also see http://www.unicode.org/mail-arch/unicode-ml/y2009-m08/0207.html, which says (From: announcements@unicode.org; Date: Wed Aug 26 2009 - 19:48:45 CDT): >>>> The data files in the Unicode Character Database for Unicode 5.2 have been revised to include all of the authorized changes from the last UTC meeting. If you use any of the Unicode data in your implementations, please update a test version of your implementation to use those files and run your tests. If there are any showstopper bugs, please report them (using http://www.unicode.org/reporting.html) as soon as possible. From this point, the only adjustments that will be made to the data will be on the basis of showstopper bugs, including bugs uncovered in the process of updating the Unicode Collation data files for UCA 5.2. >>>> This means: a) Unicode 5.2 is close to being ready for release in October. b) Implementations (such as Ruby) should test the data. Regards, Martin. On 2009/08/26 19:39, Martin J. Dürst wrote: > Hello Yui, others, > > [I'd really like to hear from Yugui, because she is responsible for > 1.9.1 and 1.9.2.] > > On 2009/08/26 18:46, Yui NARUSE wrote: >> Issue #1889 has been updated by Yui NARUSE. >> >> >> I see. >> ruby_1_9_2 release branch will be created sooner. > > Good point. According to [ruby-core:23977], there is a feature freeze on > Sept. 25. The release of Unicode 5.2 (final!) is planned for October > 2009 (see to http://www.unicode.org/versions/beta.html). > > [My personal guess is that this might happen in the week of October 12, > you can guess the reason for why I guess this date at > http://www.unicodeconference.org/. This would be before release > candidate 1 of 1.9.2.] > > Last year, additions of transcodings (in essence just more data) were > allowed even after the feature freeze. In my view, moving to the latest > stable Unicode data version is very similar. Another way to think about > it is that it's possible to include Unicode 5.2 beta in 1.9.2 while > 1.9.2 is not yet final. This runs the risk that we have to move back > from Unicode 5.2 to Unicode 5.1 if Unicode 5.2 doesn't go final before > December or so, but I consider this risk very low (the Unicode > consortium has an extremely well established release process). On the > other hand, I consider the fact that a final Ruby release contains the > latest stable Unicode data a big plus, both for usage and for > 'marketing'. Also, if I were the maintainer of one of the 'earlier' > branches, I would try to follow stable Unicode versions, too. > > So my proposal would be: > - Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to > update to latest stable Unicode version. > - Move to Unicode 5.2 (beta) for trunk and 1.9.2. > - Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated. > - Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when > Unicode 5.2 goes final. > > My main concern currently would be that, as far as I understand, not all > properties are currently automatically updated. But I think that can be > fixed by September 25th. > > Regards, Martin. > >> ---------------------------------------- >> http://redmine.ruby-lang.org/issues/show/1889 >> >> ---------------------------------------- >> http://redmine.ruby-lang.org > -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Updated by Yui NARUSE over 2 years ago
Thank you for information. I tested the data, and found & fixed a bug in Oniguruma. (r24677) This bug is from original Oniguruma. Oniguruma limited the max length of a property name to 20. This raises a bug on Unicode 5.2. If you update data for Ruby, update enc/unicode/UnicodeData.txt and Scripts.txt, and run ruby tool/enc-unicode.rb enc/unicode/UnicodeData.txt enc/unicode/Scripts.txt > enc/unicode/name2ctype.kwd > So my proposal would be: > - Stay with Unicode 5.1 to allow maintainers of 1.9.1 (and below) to > update to latest stable Unicode version. This is depend on Yugui's policy, but 1.9.1 seems to be leave as is. > - Move to Unicode 5.2 (beta) for trunk and 1.9.2. I tried /UnicodeData-5.2.0d12.txt and Scripts-5.2.0d13.txt, and it works. So if Unicode 5.2 is released on October, this is available option. > - Update trunk (and 1.9.2) whenever Unicode 5.2 (beta) gets updated. If Ruby 1.9.2 is Unicode 5.2, I agree this. > - Update trunk (and 1.9.2, 1.9.1 (and below)) to Unicode 5.2 when Unicode 5.2 goes final. If Ruby 1.9.2 is Unicode 5.2, I agree this except 1.9.1.
Updated by Run Paint Run Run over 2 years ago
It appears that http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt did not accurately reflect the behaviour on Onigurma before this patch. For example, [[:word:]] used to match No and Nl characters; but now it doesn't. [[:print:]], [[:graph:]], and [[:cntrl:]] used to match private-use and format characters; now they don't. It's an easy fix, either way, but it would be nice to have the specs agree with reality.
Updated by Yui NARUSE over 2 years ago
RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document. I think, POSIX properties should follow UTS18. If you have a patch, I'll merge it.
Updated by Run Paint Run Run over 2 years ago
> RE.txt is for original Oniguruma, not for Ruby 1.9's regexp.
> We may need our own document.
Absolutely. :-) Can I open a ticket?
> I think, POSIX properties should follow UTS18.
UTS18 defines [[:word:]] as: \p{alpha}, \p{gc=Mark}, \p{digit}, \p{gc=Connector_Punctuation}. Where 'digit' is defined elsewhere as Nd. So [[:word:]] shouldn't match No and Nl, which means that the current version is right, and the old wrong.
[[:print:]] is defined as \p{graph} \p{blank} -- \p{cntrl}, where '--' means 'set difference'.
[[:graph:]] is defined as [^\p{space} \p{gc=Control} \p{gc=Surrogate} \p{gc=Unassigned}]. Private use characters are in the Co category, so neither [[:graph:]] or [[:print:]] encompasses them. Format characters are in category Cf, so neither [[:graph:]] nor [[:print:]] includes them.
[[:cntrl:]] is defined as \p{gc=Control}, i.e. members of Cc (not C, as you may expect). This again excludes format and private-use characters.
So it appears that the patch was correct, and the previous Onigurma was wrong. I'll leave it you decide whether this needs backporting. Sorry for the trouble.
Updated by Yui NARUSE over 2 years ago
> > RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. > > We may need our own document. > > Absolutely. :-) Can I open a ticket? Yes, please. I think, this change is not a bugfix. But whether backport a patch or not is depend on stable maintainer. If you want to backport, open a ticket on 1.9.1 as backport and persuade yugui.
Updated by Yukihiro Matsumoto over 2 years ago
Hi, In message "Re: [ruby-core:25413] [Feature #1889] Teach Onigurma Unicode 5.0 Character Properties" on Sun, 6 Sep 2009 01:46:46 +0900, Run Paint Run Run <redmine@ruby-lang.org> writes: |It appears that http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt did not accurately reflect the behaviour on Onigurma before this patch. Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed. matz.