Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-04T01:30:29Z</p> <ul></ul><p>=begin</p> <blockquote> <blockquote> <p>If ICU is unfeasible, I'd appreciate understanding why.</p> </blockquote> </blockquote> <blockquote> <p>It converts everything to UTF-16 internally.</p> </blockquote> <p>Thank you, Nikolai. I understand how that would make its conversion or transliteration APIs problematic, but for property lookup/resolution would it not be easier to use its functions that accept a codepoint than write our own? Or is it just a downside of the CSI model that String methods can't work correctly with Unicode? (As ever, I apologise for using the BTS to ask questions, but I don't know where else I can).<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-04T03:35:57Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li></ul><p>=begin<br> The main reason is resource.<br> We worked around encoding independent area so far.</p> <p>Anyway, if we have ICU library, we still have some problems:</p> <ul> <li>dependency</li> <li>non-Unicode</li> <li>where to implement</li> </ul> <p>First, the dependency to ICU seems a problem.<br> ICU is well portable library, but Ruby is also portable.<br> We may be in trouble in some environment.<br> Moreover the same version of Ruby but another version of ICU environment<br> will be hard to support.<br> So core String class using ICU seems difficult.<br> Bundled some library seems more acceptable.</p> <p>Second, non-Unicode encodings may be a problem.<br> But regexp works differently between Unicode and non-Unicode now.<br> So methods of some ICU wrapper can't work with non-Unicode strings is acceptable.</p> <p>Third, where to implement maybe the largest problem.<br> As I stated, implement to String class is hard to accept.<br> So the wrapper will be another class or module.<br> Naming problem around APIs will occur.<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-05T05:26:32Z</p> <ul></ul><p>=begin<br> Thank you, naruse.</p> <blockquote> <p>First, the dependency to ICU seems a problem.<br> ICU is well portable library, but Ruby is also portable.<br> We may be in trouble in some environment.</p> </blockquote> <p>If ICU was not available could we not fall back to the current behaviour?</p> <blockquote> <p>Moreover the same version of Ruby but another version of ICU environment<br> will be hard to support.</p> </blockquote> <p>If Ruby tracks the stable build of ICU is this likely to be a problem. I imagine that the main ICU updates come when a new version of Unicode is released.</p> <blockquote> <p>Second, non-Unicode encodings may be a problem.<br> But regexp works differently between Unicode and non-Unicode now.<br> So methods of some ICU wrapper can't work with non-Unicode strings is acceptable.</p> </blockquote> <p>Indeed. The fallback to ASCII-only semantics is always available, and mirrors how we handle non-ASCII-compatible encodings now.</p> <blockquote> <p>Third, where to implement maybe the largest problem.<br> As I stated, implement to String class is hard to accept.<br> So the wrapper will be another class or module.<br> Naming problem around APIs will occur.</p> </blockquote> <p>Well, String would be ideal. We would have the current function and a Unicode function, then choose between them when the method was invoked.</p> <p>If that isn't possible, has the idea of String::Unicode been considered? I assume it would be too major of a change.</p> <p>Another approach would be a 'unicode' library in stdlib that when loaded monkey-patched String. I don't favour this, however, because it seems hacky; String should work transparently with Unicode, IMO.</p> <p>I suppose there are two distinct questions here:</p> <ol> <li>Should String methods work correctly with Unicode by default?</li> <li>If so, do we hand-roll a solution or use ICU?</li> </ol> <p>Assuming we answer (1) in the affirmative, I'd be surprised if re-implementing parts of ICU took fewer resources than using it directly.</p> <p>Perhaps the next stage is to investigate what methods in String need to be changed, and to what extent?<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-05T18:39:28Z</p> <ul></ul><p>=begin</p> <blockquote> <p>If ICU was not available could we not fall back to the current behaviour?</p> </blockquote> <p>No, if so, it lose portability of scripts.<br> Codes will depend both Ruby's version and whether ICU is installed or not.<br> Codes wrote with ICU runs different in without ICU.<br> This will be a trouble maker.</p> <blockquote> <p>Well, String would be ideal.<br> We would have the current function and a Unicode function,<br> then choose between them when the method was invoked.</p> </blockquote> <p>Adding Unicode sensitive functions to String may be accepted.<br> For example if the library is loaded, String#unicode_to_i is available.</p> <blockquote> <p>If that isn't possible, has the idea of String::Unicode been considered?<br> I assume it would be too major of a change.</p> </blockquote> <p>This seems hard to accept.<br> Ruby uses large class: don't split classes if they are similar.</p> <blockquote> <p>Another approach would be a 'unicode' library in stdlib that when loaded monkey-patched String.<br> I don't favour this, however, because it seems hacky; String should work transparently with Unicode, IMO.</p> </blockquote> <p>This is a approach 1.8's jcode.rb used.<br> Of course, it's not good.</p> <blockquote> <p>I suppose there are two distinct questions here:</p> <ol> <li>Should String methods work correctly with Unicode by default?</li> </ol> </blockquote> <p>NO.<br> Ruby's core methods treat ASCII as special.<br> For example, variables naming rule <a class="issue tracker-1 status-6 priority-4 priority-default closed" title="Bug: Cannot make constants using upper-case extended characters? (Rejected)" href="https://bugs.ruby-lang.org/issues/1853">#1853</a>, \s \w \d of regexp, and so on.</p> <blockquote> <ol start="2"> <li>If so, do we hand-roll a solution or use ICU?</li> </ol> </blockquote> <p>I won't write from scratch, but if Martin does, I don't object.<br> How about Martin?</p> <blockquote> <p>Perhaps the next stage is to investigate what methods in String need to be changed, and to what extent?</p> </blockquote> <p>Matz said Ruby's core methods are ASCII sensitive, are't Unicode.<br> So exist methods won't change.<br> Unicode sensitive methods will be added as another methods.<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-06T00:05:09Z</p> <ul></ul><p>=begin</p> <blockquote> <p>Adding Unicode sensitive functions to String may be accepted.<br> For example if the library is loaded, String#unicode_to_i is available.</p> </blockquote> <p>Do you like that API? It feels clumsy to me. Users will have to change their code just to handle Unicode strings, when Ruby could make the determination automatically.</p> <blockquote> <p>Matz said Ruby's core methods are ASCII sensitive, are't Unicode.<br> So exist methods won't change.</p> </blockquote> <p>:-(</p> <p>FWIW, Python uses <a href="http://hg.python.org/cpython/file/a31c1b2f4ceb/Tools/unicode/makeunicodedata.py" class="external">http://hg.python.org/cpython/file/a31c1b2f4ceb/Tools/unicode/makeunicodedata.py</a> to make the C header files, e.g. <a href="http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodedata_db.h" class="external">http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodedata_db.h</a> and <a href="http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodename_db.h" class="external">http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodename_db.h</a> then <a href="http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodedata.c" class="external">http://hg.python.org/cpython/file/a31c1b2f4ceb/Modules/unicodedata.c</a> as the API. IOW, a generalised approach of the Onigurma patch.</p> <p>It would take me a while, but if Martin doesn't have the time, I may be able to produce something along those lines, albeit without the clever optimisations initially.<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-06T00:13:58Z</p> <ul></ul><p>=begin<br> Another data point: Perl 6 is optionally linking against ICU (<a href="http://github.com/rakudo/rakudo/tree" class="external">http://github.com/rakudo/rakudo/tree</a>).<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-06T00:33:56Z</p> <ul></ul><p>=begin<br> I cannot speak for Ruby core team but I personally do not like ICU.</p> <p>It's not that I am against Unicode but Unicode is only part of what<br> Ruby aims to support yet ICU is huge and written in C++ which would<br> drastically enlarge Ruby core size and reduce portability.</p> <p>Better support for character classes outside of the ASCII range is<br> certainly desirable but it requires more design and planning than<br> "Hey, I saw ICU, it's cool, let's use it".</p> <p>First the questions</p> <ul> <li> <p>What exactly we want supported for what purposes?</p> </li> <li> <p>What the cost would be?</p> </li> <li> <p>Do we really want to pay that cost?</p> </li> </ul> <p>have to be answered.</p> <p>There are clearly multiple options and I haven't gathered enough data<br> to even make an opinion what would be good option for Ruby.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-09-07T01:38:29Z</p> <ul></ul><p>=begin</p> <blockquote> <blockquote> <p>Adding Unicode sensitive functions to String may be accepted.<br> For example if the library is loaded, String#unicode_to_i is available.</p> </blockquote> <p>Do you like that API? It feels clumsy to me.</p> </blockquote> <p>No, I think so too.</p> <blockquote> <p>Users will have to change their code just to handle Unicode strings, when<br> Ruby could make the determination automatically.</p> </blockquote> <p>Those Unicode sensitive methods leave as is for compatibility<br> even if Ruby could make the determination automatically.</p> <blockquote> <blockquote> <p>Matz said Ruby's core methods are ASCII sensitive, are't Unicode.<br> So exist methods won't change.</p> </blockquote> <p>:-(</p> </blockquote> <p>Ruby core methods are for programers.<br> So naming collisions are resolved for programers side.</p> <blockquote> <p>FWIW, Python uses .. as the API. IOW, a generalised approach of the Onigurma patch.</p> </blockquote> <p>Thanks,</p> <blockquote> <p>It would take me a while, but if Martin doesn't have the time, I may be able to<br> produce something along those lines, albeit without the clever optimisations initially.</p> </blockquote> <p>If you do, I'll discuss about this and its API on next developper's meeting.</p> <blockquote> <p>Another data point: Perl 6 is optionally linking against ICU (<a href="http://github.com/rakudo/rakudo/tree" class="external">http://github.com/rakudo/rakudo/tree</a>).</p> </blockquote> <p>Ruby hates compile options which change behaviors over Ruby Layer.<br> So those options are hard to be merged.</p> <blockquote> <p>I cannot speak for Ruby core team but I personally do not like ICU.</p> <p>It's not that I am against Unicode but Unicode is only part of what<br> Ruby aims to support yet ICU is huge and written in C++ which would<br> drastically enlarge Ruby core size and reduce portability.</p> </blockquote> <p>Yes, so if we use ICU, the library will be a bundled or external library, not core.</p> <blockquote> <p>Better support for character classes outside of the ASCII range is<br> certainly desirable but it requires more design and planning than<br> "Hey, I saw ICU, it's cool, let's use it".</p> </blockquote> <p>true.</p> <blockquote> <p>First the questions</p> <ul> <li>What exactly we want supported for what purposes?</li> </ul> </blockquote> <p>Unicode Normalization or other conversions, Unicode sensitive case convertions or matches<br> and so on are the base framework of current internet specs.<br> For exmaple, if uri library supports IRI, they need those functions.</p> <blockquote> <ul> <li>What the cost would be?</li> </ul> </blockquote> <p>Human resource and distribution size.</p> <blockquote> <ul> <li>Do we really want to pay that cost?</li> </ul> </blockquote> <p>YES, more and more specs depend on Unicode.<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-10-19T01:03:57Z</p> <ul></ul><p>=begin<br> I discovered ICU and ICU4R back in 2007 and I just now moved it to<br> Ruby 1.9. I'm a pretty big advocate of using ICU. There is nothing<br> that has as many encodings as ICU to my knowledge. It is the only one<br> that addresses many of the EBCDIC encodings (of which there 147 some<br> odd of them).</p> <p>The reason I came to use ICU is the application I'm working on needs<br> to translate EBCDIC encoded Japanese characters to something a browser<br> can use such as utf-8. ICU is the only portable library that I found<br> and it is also the only library that had the encodings that I needed.</p> <p>I'm assuming a few things here. One is that this:</p> <p><a href="http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html" class="external">http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html</a></p> <p>is accurate for the most part. In particular, this paper seems to say<br> that there is choice between a UCS model and an CSI model and Ruby 1.9<br> has choosen CSI. From my perspective, a CSI model should be an<br> envelope around a UCS model.</p> <p>My background is working inside IBM for 20+ years and I've bumped into<br> multi-byte language issues since 1989. I'm not an expert by any<br> meaans but I have seen IBM struggle with this for decades.</p> <p>Perhaps only IBM and legacy IBM applications have these issues. I<br> simply don't know but I will say that all of the other open source<br> language encoding implementations are very small in the number of<br> encodings they do compared to what you see when dealing with legacy<br> international applications.</p> <p>In the text below, I will use "aaa" to represent a string using an<br> encoding of A, "bbb" will represent a string using an encoding of B,<br> and so on. I will also simply put B to stand for encoding B.</p> <p>I believe that the CSI model is a great choice: why translate<br> everything all the time? If an application is going to read data and<br> write it back out, translating it is both a waste of time and error<br> prone.</p> <p>I believe the implementors of a UCS model fall back and say that if<br> the application is going to compare strings they must be in a common<br> encoding -- Ruby agrees with this point. And, they also would argue<br> that if you want to translate "aaa" into B, it is simply more<br> practical to go to a common encoding C first. Then you have only 2N<br> encoders instead of N^2 encoders. To me, that argument is very<br> sound. If plausible, I would allow specific A to B translators to be<br> plugged in.</p> <p>The key place where I believe Ruby's choice of a CSI model wins is the<br> fact that there are a lot of places that data can be used and<br> manipulated without translation. Keeping and using the CSI model in<br> all those places is a clear win. In all those places, the data is<br> opaque; it is not interpreted or understood by the application.</p> <p>Opaque data can be compared for equality as Ruby appears to be doing<br> now -- the two strings must have the same encoding and byte for byte<br> compare as equal.</p> <p>Technically, opaque data can be concatenated and spliced as well.<br> This is one place that Ruby's 1.9 implementation surprised me a bit.<br> It could be that "aaa" + "bbb" yields String that is a list of<br> SubStrings. I'll write as x = [ "aaa", "bbb" ]. That would have many<br> useful concepts: length would be the sum of the length of all the<br> SubStrings. x[1] would be "a". x[4] would be "b". x[2,2] would yield<br> a String with two SubStrings (again, this is just how I'm representing<br> it) [ "a", "b" ]. x.encoding would return Mixed in these cases.<br> Encoding would be a concept attached to a SubString rather than<br> String. x.each would return a sequence of "a", "a", "a", "b", "b",<br> "b" each with a encoding of A for the "a"s and B for the "b"s. String<br> would still be what most applications use. Rarely would they need to<br> know about the SubStrings.</p> <p>Many text manipulations can be done with opaque data because the<br> characters themselves are still not being interpreted by the<br> application. To the application are just "dodads" that those human<br> guys know about. I believe that if Ruby wants to hold strongly to the<br> CSI model that encoding agnostic string manipulations should be<br> implemented.</p> <p>The places where the actual characters are "understood" by an<br> application is for sorting (collation) and if, for some external<br> reason, they need to be translated to a particular encoding.</p> <p>Sorting not only depends upon the encoding but also the language.<br> Sorting could be done with routines specific to an encoding plus<br> language but I believe that is impractical to implement. Utopia would<br> be the ability to plug (and grow) sort routines that would be specific<br> to the encoding and language with a fall back going to a sort routine<br> tailored for the language and a common encoding such as UTF-16 and if<br> the language was not known (or implemented), fall back to sorting<br> based upon just the encoding, and if that was not available, fall back<br> to a sort based upon a common encoding.</p> <p>As has been pointed out already, the String#to_i routine needs to be<br> encoding savvy. There are probably a few more methods that need to be<br> encoding savvy.</p> <p>The translations, collations, and other places that characters must be<br> understood by the application is where I believe using ICU is a hugh<br> win. ICU should not be used all the time because most of the time, no<br> undetanding of the characters are needed by the application. But if<br> translation or collation are needed, ICU is a hugh repository that is<br> already implemented and available.</p> <p>I have not seen arguments aginst ICU that I believe hold much weight.<br> It is more portable than any iconv implementation (because iconv has<br> been stuff into the libc implemntation and pulling it back apart<br> looked really hard to me). The fact that it is hugh is just a<br> reflection of the size of the problem.</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-10-19T04:42:15Z</p> <ul></ul><p>=begin</p> <blockquote> <p>needs to translate EBCDIC encoded Japanese characters</p> </blockquote> <p>What is the encoding and do you the converter for the encoding should be included?<br> I guess, the converter can convert the encoding to EUC-JP by an algorithm.</p> <blockquote> <p>I'm assuming a few things here. One is that this:<br> <a href="http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html" class="external">http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html</a><br> is accurate for the most part.</p> </blockquote> <p>I wrote it.</p> <blockquote> <p>It could be that "aaa" + "bbb" yields String that is a list of<br> SubStrings. I'll write as x = [ "aaa", "bbb" ]. That would have many<br> useful concepts: length would be the sum of the length of all the<br> SubStrings. x[1] would be "a". x[4] would be "b". x[2,2] would yield<br> a String with two SubStrings (again, this is just how I'm representing<br> it) [ "a", "b" ]. x.encoding would return Mixed in these cases.<br> Encoding would be a concept attached to a SubString rather than<br> String. x.each would return a sequence of "a", "a", "a", "b", "b",<br> "b" each with a encoding of A for the "a"s and B for the "b"s. String<br> would still be what most applications use. Rarely would they need to<br> know about the SubStrings.</p> </blockquote> <p>That consept is sometimes introduced as rope.</p> <p><a href="http://jp.rubyist.net/magazine/?0022-RubyConf2007Report#l13" class="external">http://jp.rubyist.net/magazine/?0022-RubyConf2007Report#l13</a><br> <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.9450" class="external">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.9450</a><br> <a href="http://www.kmonos.net/wlog/39.php#_1841040529" class="external">http://www.kmonos.net/wlog/39.php#_1841040529</a> in Japanese<br> <a href="http://d.hatena.ne.jp/ku-ma-me/20070730/p1" class="external">http://d.hatena.ne.jp/ku-ma-me/20070730/p1</a> in Japanese</p> <p>Rope is:</p> <ul> <li>fast string concatnation</li> <li>fast substring get</li> <li>can't change substring</li> <li>slow index access to a character</li> </ul> <p>But Ruby's string is mutable.<br> This seems a critical issue for rope.<br> Moreover Ruby users often use regexp match to strings.<br> I don't think rope has enough merit to implement despite such tough environment.</p> <blockquote> <p>I believe that if Ruby wants to hold strongly to the<br> CSI model that encoding agnostic string manipulations should be<br> implemented.</p> </blockquote> <p>Ruby is practical languate, although Ruby use CSI model :-)<br> In current situation, such consept is hard to realize in Ruby<br> because of performance, difficulty of implementation, and needs.</p> <blockquote> <p>Sorting not only depends upon the encoding but also the language.<br> Sorting could be done with routines specific to an encoding plus<br> language but I believe that is impractical to implement.</p> </blockquote> <p>Yes, String needs language.<br> This is an open problem.<br> We may have to implement rope for languages.</p> <blockquote> <p>It is more portable than any iconv implementation (because iconv has<br> been stuff into the libc implemntation and pulling it back apart<br> looked really hard to me).</p> </blockquote> <p>For String, core of Ruby, iconv is out of the question.<br> Core library and its dependency must as portable as Ruby.</p> <blockquote> <p>The fact that it is hugh is just a reflection of the size of the problem.</p> </blockquote> <p>I think the problem is too heavy to treat by current Ruby.<br> And Ruby 1.9 uses CSI model; it is beyond ICU, which uses UCS model.<br> =end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-10-20T14:27:50Z</p> <ul></ul><p>=begin<br> Hello Perry,</p> <p>On 2009/10/19 1:03, Perry Smith wrote:</p> <blockquote> <p>Issue <a class="issue tracker-2 status-6 priority-4 priority-default closed" title="Feature: Consider the ICU Library for Improving and Expanding Unicode Support (Rejected)" href="https://bugs.ruby-lang.org/issues/2034">#2034</a> has been updated by Perry Smith.</p> <p>I discovered ICU and ICU4R back in 2007 and I just now moved it to<br> Ruby 1.9. I'm a pretty big advocate of using ICU. There is nothing<br> that has as many encodings as ICU to my knowledge. It is the only one<br> that addresses many of the EBCDIC encodings (of which there 147 some<br> odd of them).</p> </blockquote> <p>It's no surprise that ICU is strong on EBCDIC. ICU started at IBM, and<br> IBM still contributes a lot :-). [If IBM contributed on Ruby, Ruby may<br> also be stronger on EBCDIC.]</p> <blockquote> <p>The reason I came to use ICU is the application I'm working on needs<br> to translate EBCDIC encoded Japanese characters to something a browser<br> can use such as utf-8. ICU is the only portable library that I found<br> and it is also the only library that had the encodings that I needed.</p> </blockquote> <p>Can you tell me what encodings exactly you need? And which of them are<br> table based? (see also Yui's message) We can definitely have a look at them.</p> <p>One big problem with ICU is that it is UTF-16-based, whereas Ruby<br> (mainly) uses UTF-8 for Unicode. But fortunately, there are exceptions.<br> I learned just last week at the Internationalization and Unicode<br> conference that there is now a purely UTF-8 based sorting routine in<br> ICU. I think it may make sense for Ruby to try and extract it.</p> <blockquote> <p>I'm assuming a few things here. One is that this:</p> <p><a href="http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html" class="external">http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html</a></p> <p>is accurate for the most part. In particular, this paper seems to say<br> that there is choice between a UCS model and an CSI model and Ruby 1.9<br> has choosen CSI. From my perspective, a CSI model should be an<br> envelope around a UCS model.</p> </blockquote> <p>Can you explain what you mean by 'envelope around UCS model'?</p> <p>The way I understand your "envelope around UCS model" is that it's easy<br> to use an UCS model inside Ruby CSI; the main thing you have to do is to<br> use the -U option. But maybe you meant something different?</p> <blockquote> <p>I believe the implementors of a UCS model fall back and say that if<br> the application is going to compare strings they must be in a common<br> encoding -- Ruby agrees with this point. And, they also would argue<br> that if you want to translate "aaa" into B, it is simply more<br> practical to go to a common encoding C first. Then you have only 2N<br> encoders instead of N^2 encoders. To me, that argument is very<br> sound. If plausible, I would allow specific A to B translators to be<br> plugged in.</p> </blockquote> <p>Ruby allows this. It's actually used e.g. for Shift_JIS <-> EUC-JP<br> translation. The reason to use it is that it allows to transcode<br> "gaiji", at least to a certain extent.</p> <blockquote> <p>The key place where I believe Ruby's choice of a CSI model wins is the<br> fact that there are a lot of places that data can be used and<br> manipulated without translation. Keeping and using the CSI model in<br> all those places is a clear win. In all those places, the data is<br> opaque; it is not interpreted or understood by the application.</p> <p>Opaque data can be compared for equality as Ruby appears to be doing<br> now -- the two strings must have the same encoding and byte for byte<br> compare as equal.</p> <p>Technically, opaque data can be concatenated and spliced as well.<br> This is one place that Ruby's 1.9 implementation surprised me a bit.</p> </blockquote> <p>Yes, you can take the CSI model further and further. But you will always<br> bump into problems where encodings do not match sooner or later.<br> (btw, in Ruby, you can concatenate as long as the data is in an<br> ASCII-compatible encoding and is ASCII-only.)</p> <blockquote> <p>It could be that "aaa" + "bbb" yields String that is a list of<br> SubStrings. I'll write as x = [ "aaa", "bbb" ].</p> </blockquote> <p>On the file level, this would be similar to having a file with internal<br> change of character encoding. At the very, very early stages of Web<br> internationalization, some people proposed such a model, but the Web<br> went a different way. And so went most if not all text editors, you<br> can't have a file with many different encodings at the same time. Sure<br> file encodings and internal encodings work a bit differently, but it's<br> not a disadvantage if those two models match.</p> <blockquote> <p>The places where the actual characters are "understood" by an<br> application is for sorting (collation) and if, for some external<br> reason, they need to be translated to a particular encoding.</p> </blockquote> <p>There's lots more cases. In particular regular expressions. Even with<br> Ruby's current model, it took a long time to smooth the edges.</p> <blockquote> <p>Sorting not only depends upon the encoding but also the language.</p> </blockquote> <p>Yes, but please note that sorting depends on encoding in completely<br> different ways than on language. For language, what counts is not the<br> language of the text being sorted, but the language of the user.</p> <p>Let's say you have two words, a Swedish one (översätter, to translate),<br> and a German one (öffnen, to open). Swedish sorts 'ö' after 'z', German<br> sorts 'ö' with 'o', taking the difference between the two just as a<br> secondary difference (i.e. to order words with 'o' and 'ö', first look<br> at the rest of the word, and only if the rest of the word is identical,<br> then order the word with 'ö' after the word with 'o').</p> <p>So some people argue that in an alphabetical list, the two words above<br> should be ordered (with some others thrown in) as follows:</p> <p>abstract<br> nominal<br> öffnen (German, so goes into the 'o' section)<br> often<br> substring<br> xylophone<br> zebra<br> översätter (Swedish, so goes after 'z')</p> <p>But this is wrong. There should be (at least) two sort orders for the<br> above data, one for Swedish and the other for German:</p> <p>Swedish sort order:</p> <p>abstract<br> nominal<br> often<br> substring<br> xylophone<br> zebra<br> öffnen (all 'ö's go after 'z')<br> översätter</p> <p>German sort order:</p> <p>abstract<br> nominal<br> öffnen (all 'ö's go with 'o')<br> often<br> översätter<br> substring<br> xylophone<br> zebra</p> <p>So there is no need for sorting to know the language of the data.</p> <blockquote> <p>Sorting could be done with routines specific to an encoding plus<br> language but I believe that is impractical to implement. Utopia would<br> be the ability to plug (and grow) sort routines that would be specific<br> to the encoding and language with a fall back going to a sort routine<br> tailored for the language and a common encoding such as UTF-16 and if<br> the language was not known (or implemented), fall back to sorting<br> based upon just the encoding, and if that was not available, fall back<br> to a sort based upon a common encoding.</p> </blockquote> <p>If you think this is necessary, please start implementing. In my<br> opinion, it will take you a lot of time, with very little advantage over<br> a single-encoding sorting implementation.</p> <blockquote> <p>As has been pointed out already, the String#to_i routine needs to be<br> encoding savvy. There are probably a few more methods that need to be<br> encoding savvy.</p> </blockquote> <p>Lots of places can be made more encoding-savy. But overall, I think<br> concentrating on getting more functionality for UTF-8 strings, and<br> transcoding to UTF-8 for heavy functionality, is the way to go.</p> <p>Regards, Martin.</p> <p>--<br> #-# Martin J. Dürst, Professor, Aoyama Gakuin University<br> #-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-10-21T10:09:53Z</p> <ul></ul><p>=begin<br> I will try and answer both of the posts above.</p> <p>Mostly, you both asked about which encodings. As absured as this may<br> sound, I don't know.</p> <p>When I fetch text from the legacy system, it has a two byte CCSID in<br> front of it. I have a table that translates the CCSID to the name of<br> the encoding. It is much like:</p> <p><a href="http://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.jsp" class="external">http://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.jsp</a></p> <p>I then translate the text to ICU's internal format which is UTF-16.<br> Later, I translate it to UTF-8 because that seems to work with<br> browsers.</p> <p>I have no idea if ICU does this using tables or what. My belief is<br> that it does not go to EUC-JP. I also do not know but I assume that<br> many of these are not single byte</p> <p>I do know that for Japanese text, usually the page text is encoded<br> using IBM-939. Most English is in IBM-037. But the system I'm<br> interfacing to is used world wide and I would assume that other code<br> pages are used.</p> <blockquote> <p>Can you explain what you mean by 'envelope around UCS model'?</p> </blockquote> <p>I think, based upon the reply, that you understand but to restate it:<br> use ICU everywhere you can but when you are forced to translate, do it<br> to and from some common (probably third) encoding. It just seems like<br> it would be much easier to do that.</p> <blockquote> <p>On the file level, this would be similar to having a file with internal<br> change of character encoding. At the very, very early stages of Web<br> internationalization, some people proposed such a model, but the Web<br> went a different way. And so went most if not all text editors, you<br> can't have a file with many different encodings at the same time. Sure<br> file encodings and internal encodings work a bit differently, but it's<br> not a disadvantage if those two models match.</p> </blockquote> <p>I was imagining doing this only for the internal encodings. I later<br> mentioned that translations must be done for external reasons. I<br> meant that the translation would be done when going to a file or to<br> any external data stream.</p> <blockquote> <p>rope</p> </blockquote> <p>I see this as an incorrect name which may be why it has attributes<br> that we do not want. To me, rope is made up of many strings. But I<br> wanted String made up of many -- e.g. SubStrings.</p> <p>Strings would still be mutable in my scheme. It seems plausible that<br> a data structure could be devised that would yield the Nth character<br> is the same time as the current implementation. It may require more<br> space.</p> <blockquote> <p>Regexp</p> </blockquote> <p>Yes. I totally forgot about Regexp's.</p> <p>There is one thing that confused me at the end of Martin's post. To<br> me, data never has a language. Perhaps I'm mistaken. The data only<br> have a language when viewed by a user. As he points out, a sort can<br> only be properly done when the language of the user is taken into<br> account. At least, that is how I would rephrase what he said.</p> <p>Am I missing a subtlety there?</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2009-10-25T02:58:00Z</p> <ul><li><strong>Target version</strong> set to <i>3.0</i></li></ul><p>=begin</p> <blockquote> <p>If you think this is necessary, please start implementing. In my<br> opinion, it will take you a lot of time, with very little advantage over<br> a single-encoding sorting implementation.</p> </blockquote> <p>Unicode strings need language to decide glyph.<br> This is implied problem of Unicode Han Unification.<br> For example U+9AA8's difference between China and Japanese glyph.<br> <a href="http://www.atmarkit.co.jp/fxml/rensai/xmlwomanabou11/learning-xml11.html" class="external">http://www.atmarkit.co.jp/fxml/rensai/xmlwomanabou11/learning-xml11.html</a> in Japanese but images are showed in</p> <blockquote> <p>When I fetch text from the legacy system, it has a two byte CCSID in<br> front of it. I have a table that translates the CCSID to the name of<br> the encoding. It is much like:</p> <p><a href="http://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.jsp" class="external">http://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.jsp</a></p> </blockquote> <p>CCSID is a part of IBM's encoding framework: CDRA.<br> IBM's "Code Page" is a CCS (Coded Character Set).<br> And CCSID ties up Code Pages with encoding schems.<br> So IBM's CCSID is the same as an encoding, a charset and Microsoft's Code Page.</p> <p>So a CCSID can be as an encoding if needed.</p> <blockquote> <p>There is one thing that confused me at the end of Martin's post. To<br> me, data never has a language. Perhaps I'm mistaken. The data only<br> have a language when viewed by a user. As he points out, a sort can<br> only be properly done when the language of the user is taken into<br> account. At least, that is how I would rephrase what he said.</p> </blockquote> <p>Unicode unifies characters between some languages, for example above U+9AA8.<br> Another critical example is capital letter of i, it is not I in Turkish.<br> <a href="http://unicode.org/Public/UNIDATA/SpecialCasing.txt" class="external">http://unicode.org/Public/UNIDATA/SpecialCasing.txt</a><br> (this is one reason why String#upcase is not Unicode sensitive)</p> <p>More example is following:</p> <ul> <li> <a href="http://unicode.org/reports/tr10/" class="external">http://unicode.org/reports/tr10/</a> UNICODE COLLATION ALGORITHM; effects sort</li> <li> <a href="http://unicode.org/reports/tr11/" class="external">http://unicode.org/reports/tr11/</a> East Asian Width; effects String#center</li> <li> <a href="http://unicode.org/reports/tr18/" class="external">http://unicode.org/reports/tr18/</a> UNICODE REGULAR EXPRESSIONS; Tailored Support: Level 3 effects String#upcase and /i/i<br> =end</li> </ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2011-05-16T12:55:49Z</p> <ul></ul><p>Hi. I'm a newcomer to Ruby - studying it right now - but I've been writing multi-lingual systems for 15 years. I think I can shed some light on internationalization issues.</p> <p>First, I have to say that I was pretty amazed when I discovered that Ruby is not either a multi-character set system or native Unicode. I just assumed that since it comes from Japan and is a relatively new language multi-byte and Unicode support would have been automatic for its developers. Well, that's life and you can't go back in time and change things.</p> <p>Today, for the vast majority of serious applications, Unicode support is just required. It really doesn't matter what kind of system you are doing. Think about almost anything. Would it be reasonable if a Web mail system didn't allow Chinese e-mails? If a bulletin board system didn't allow Japanese posts? If a task management system didn't allow users to at least create their own tasks in Thai? If a blogging platform didn't support Arabic?</p> <p>I understand the concerns about backward compatibility if you convert String to a native Unicode type but I think the pain would be worth it. Legacy applications could stay on old versions of Ruby if necessary. New applications would run in native Unicode.</p> <p>If you do go to native Unicode you have three realistic choices - UTF-8, UTF-16, and UTF-32.</p> <p>o UTF-8 - Variable width encoding - ASCII is 1 byte, many characters are 2 byte, and some characters are 3 byte - significant performance impact but nice that it preserves ASCII semantics. Biggest disadvantage is that it encourages lazy / ignorant programmers working in English to assume that CHAR = BYTE.</p> <p>o UTF-16 - 2 bytes for most characters, but requires 4 bytes for characters in the Supplementary Planes. (See <a href="http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane" class="external">http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane</a>). You can choose to ignore the Supplementary Planes (which was the initial design choice for Java) but that has significant impacts on your product's suitability for East Asian languages, especially name handling. Java has been modified now into what I consider to be a bastard hybrid that supports Supplementary Planes but with great pain. See <a href="http://java.sun.com/developer/technicalArticles/Intl/Supplementary/" class="external">http://java.sun.com/developer/technicalArticles/Intl/Supplementary/</a>. I strongly recommend against this approach. Sun had legacy issues since they started with Unicode support that don't apply to Ruby. Unless developers understand Unicode well enough to test with Supplementary Plane characters (and most don't) they're going to have all sorts of fun bugs when working with this approach.</p> <p>o UTF-32 - 4 byte fixed width representation. Definitely the fastest and simplest implementation but a very painful memory penalty - "Hello" will now be 20 bytes instead of 5.</p> <p>If I was creating Ruby from scratch I would use two types - string_utf8 and string_utf32. utf8 is optimized for storage and utf32 is optimized for speed. "string" would be aliased to string_utf32 since most applications care more about speed of string operations than memory. Documentation would strongly encourage storing files in UTF-8.</p> <p>Next issue, of course, is sorting. In the discussion above, no one has mentioned Locale and Collations. A straight byte order sort in Unicode is usually useless. In fact, multi-lingual sorts are different in every language. For example, in English the correct sort is leap, llama, lover. In Spanish, however, it would be leap, lover, llama - the "ll" is sorted after all other "l" words. String sorts need to have an extra argument - locale or collation. See how database systems like Oracle and SQL Server handle this to get a better understanding. Note that multi-lingual sorts usually make no sense - how do you sort "中文", "العربية", "English"? [Those are the strings "Chinese", "Arabic", "English" in native scripts - hopefully they will work on this forum... if not, well, here's an example of why Unicode support is necessary everywhere.]</p> <p>String comparisons also require complex semantics. Many Unicode characters can be encoded as a single code point or as multiple code points (for a base character and accents and similar items). So you need to normalize before comparing - a pure bytewise comparison will give false negatives. The suggestion that you can do bytewise comparisons on opaque strings above is just incorrect.</p> <p>This leads into the next issue - handling other character sets.</p> <p>If the world was full of smart foresightful multi-lingual people who did everything right from the beginning then we would have had Unicode from day one of computing and ASCII and other legacy character sets would not ever have existed. Well, tough... they do and amazingly enough people are still creating web pages, files, etc. in ISO-8859-1, Big5-HKSCS, GB18030, etc. YEEARGH.</p> <p>For example, if I am writing a web spider, I need to pull in web pages in any character set, parse them for links, and then follow those links. I must support any and all character sets.</p> <p>So, you either need character set typed strings or the ability to convert any character set to your native types for processing. My mind quails at the complexities involved in working with character set typed strings. For example, what happens when you concatenate two strings in different character sets? I just wouldn't go there. If programmers need to do character wise processing they should convert to native character sets. That is what Java does.</p> <p>See here for a list of character sets supported for conversion in Java: <a href="http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html" class="external">http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html</a>. Also, check out <a href="http://download.oracle.com/javase/6/docs/api/java/lang/String.html" class="external">http://download.oracle.com/javase/6/docs/api/java/lang/String.html</a> and the String constructors that take Charset arguments to properly convert byte arrays to native character set. See also <a href="http://download.oracle.com/javase/1.5.0/docs/api/java/io/InputStreamReader.html" class="external">http://download.oracle.com/javase/1.5.0/docs/api/java/io/InputStreamReader.html</a> - "An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted."</p> <p>I hope this is helpful. Proper support for i18n is a big job but it is necessary if Ruby is really going to be a serious platform for building global systems.</p> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2011-05-17T16:03:22Z</p> <ul></ul><p>Michael Friedman wrote:</p> <blockquote> <p>Hi. I'm a newcomer to Ruby - studying it right now - but I've been writing multi-lingual systems for 15 years. I think I can shed some light on internationalization issues.</p> <p>First, I have to say that I was pretty amazed when I discovered that Ruby is not either a multi-character set system or native Unicode. I just assumed that since it comes from Japan and is a relatively new language multi-byte and Unicode support would have been automatic for its developers. Well, that's life and you can't go back in time and change things.</p> </blockquote> <p>Thank you for interesting to Ruby.<br> But you seems use Ruby 1.8 (it has limited support to multi byte characters).</p> <p>If you want m17n supports, use 1.9 and read following documents.</p> <ul> <li><a href="http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html" class="external">http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html</a></li> <li><a href="https://github.com/candlerb/string19" class="external">https://github.com/candlerb/string19</a></li> <li><a href="http://blog.grayproductions.net/categories/character_encodings" class="external">http://blog.grayproductions.net/categories/character_encodings</a></li> </ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2011-10-18T09:16:38Z</p> <ul><li><strong>Project</strong> changed from <i>Ruby master</i> to <i>14</i></li><li><strong>Category</strong> deleted (<del><i>M17N</i></del>)</li><li><strong>Target version</strong> deleted (<del><i>3.0</i></del>)</li></ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2011-10-23T17:21:09Z</p> <ul><li><strong>Project</strong> changed from <i>14</i> to <i>Ruby master</i></li></ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2012-11-20T20:50:45Z</p> <ul><li><strong>Target version</strong> set to <i>2.6</i></li></ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2014-07-23T10:10:41Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/10084">Feature #10084</a>: Add Unicode String Normalization to String class</i> added</li></ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2014-07-23T11:06:47Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/10085">Feature #10085</a>: Add non-ASCII case conversion to String#upcase/downcase/swapcase/capitalize</i> added</li></ul> </article> <article> <h1>Ruby master - Feature #2034: Consider the ICU Library for Improving and Expanding Unicode Support</h1> <p>2017-10-21T10:36:05Z</p> <ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Rejected</i></li></ul> </article> </main></body></html>