Ruby Issue Tracking System: Issueshttps://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112023-10-02T06:55:45ZRuby Issue Tracking System
Redmine Ruby master - Feature #19908 (Assigned): Update to Unicode 15.1https://bugs.ruby-lang.org/issues/199082023-10-02T06:55:45Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<p>The Unicode 15.1 is released.</p>
<p>The current enc-unicode.rb seems to fail because of <code>Indic_Conjunct_break</code> properties with values.</p>
<p>I'm not sure how these properties should be handled well.<br>
<code>/\p{InCB_Liner}/</code> or <code>/\p{InCB=Liner}/</code> as the comments in that file?<br>
<a href="https://github.com/nobu/ruby/tree/unicode-15.1" class="external">https://github.com/nobu/ruby/tree/unicode-15.1</a> is the former.</p> Ruby master - Feature #19317 (Assigned): Unicode ICU Full case mappinghttps://bugs.ruby-lang.org/issues/193172023-01-06T15:05:39Znoraj (Alexandre ZANNI)
<p>As announced in <a href="https://docs.ruby-lang.org/en/master/case_mapping_rdoc.html#label-Default+Case+Mapping" class="external">Case Mapping</a>, Ruby support for Unicode case mapping is not complete yet.</p>
<p>Unicode supports in Ruby is pretty awesome, it works by default nearly everywhere, things are implemented the right way and works as expected by the UTRs.</p>
<p>But some features are still missing.</p>
<p>To reach <a href="https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping" class="external">ICU Full Case Mapping support</a>, a few points need to be enhanced.</p>
<a name="context-sensitive-case-mapping"></a>
<h3 >context-sensitive case mapping<a href="#context-sensitive-case-mapping" class="wiki-anchor">¶</a></h3>
<ul class="task-list">
<li class="task-list-item">
<input type="checkbox" class="task-list-item-checkbox" disabled> cf. <a href="https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf" class="external">Table 3-17 (Context Specification for Casing) of the Unicode standard</a> and <a href="https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt" class="external">ucd/SpecialCasing.txt</a>.</li>
</ul>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="s2">"ΣΣ"</span><span class="p">.</span><span class="nf">downcase</span> <span class="c1"># returns σσ instead of σς</span>
</code></pre>
<p>Output examples in ECMAScript:</p>
<pre><code>Σ ➡️ σ
Σa ➡️ σa
aΣ ➡️ aς
aΣa ➡️ aσa
ΣA ➡️ σa
aΣ a ➡️ aς a
Σ1 ➡️ σ1
aΣ1 ➡️ aς1
ΣΣ ➡️ σς
</code></pre>
<a name="language-sensitive-case-mapping"></a>
<h2 >language-sensitive case mapping<a href="#language-sensitive-case-mapping" class="wiki-anchor">¶</a></h2>
<ul class="task-list">
<li class="task-list-item">
<input type="checkbox" class="task-list-item-checkbox" disabled> Lithuanian rules</li>
<li class="task-list-item">
<input type="checkbox" class="task-list-item-checkbox" checked disabled> Turkish and Azeri</li>
</ul>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="s2">"I"</span><span class="p">.</span><span class="nf">downcase</span> <span class="c1"># => "i"</span>
<span class="s2">"I"</span><span class="p">.</span><span class="nf">downcase</span><span class="p">(</span><span class="ss">:turkic</span><span class="p">)</span> <span class="c1"># => "ı"</span>
<span class="s2">"I</span><span class="se">\u</span><span class="s2">0307"</span><span class="p">.</span><span class="nf">upcase</span> <span class="c1"># => "İ"</span>
<span class="s2">"I</span><span class="se">\u</span><span class="s2">0307"</span><span class="p">.</span><span class="nf">upcase</span><span class="p">(</span><span class="ss">:lithuanian</span><span class="p">)</span> <span class="c1"># => "İ" instead of "I"</span>
</code></pre>
<ul class="task-list">
<li class="task-list-item">
<input type="checkbox" class="task-list-item-checkbox" disabled> using some standard locale / language codes</li>
</ul>
<p>Also, it's true that for now there are only a few language-sensitive rules (for Lithuanian, Turkish and Azeri) but why:</p>
<ul>
<li>adding a <code>:turkic</code> symbol and not a <code>:azeri</code>?</li>
<li>using full english arbitrary (why <code>turkic</code> and not <code>turkish</code>?) language name rather than some <a href="https://unicode-org.github.io/icu/userguide/locale/" class="external">ICU locale IDs</a>?
<ul>
<li>Language code ISO-639 standard</li>
<li>Script code Unicode ISO 15924 Registry</li>
<li>country code ISO-3166 standard</li>
</ul>
</li>
</ul>
<p>So I would rather see something like that</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="s2">"placeholder"</span><span class="p">.</span><span class="nf">upcase</span><span class="p">(</span><span class="ss">locale: :tr_TR</span><span class="p">)</span>
<span class="s2">"placeholder"</span><span class="p">.</span><span class="nf">upcase</span><span class="p">(</span><span class="ss">lang: :tr</span><span class="p">)</span>
</code></pre> Ruby master - Feature #19171 (Open): Update Unicode data to Unicode Version 15.1https://bugs.ruby-lang.org/issues/191712022-12-02T07:59:33Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<p>According to <a href="http://blog.unicode.org/2022/11/the-unicode-standard-2023-release.html" class="external">http://blog.unicode.org/2022/11/the-unicode-standard-2023-release.html</a>, Unicode plans to release Version 15.1 in September 2023. According to <a href="https://www.unicode.org/versions/beta.html" class="external">https://www.unicode.org/versions/beta.html</a>, public alpha review starts Feb. 7, 2023, and ends April 4, 2023. Because alpha review may not include all the files we use, it may be difficult for us to participate.</p>
<p>Public beta review is planned to start May 23, 2023, ending July 4, 2023. At this point, we should be able to test things.</p> Ruby master - Bug #18601 (Open): Invalid byte sequences in Big5 encodingshttps://bugs.ruby-lang.org/issues/186012022-02-22T22:15:06Zjanosch-x (Janosch Müller)
<p>I encoded all unicode codepoints in all encodings:</p>
<pre><code>full_string = ((0..0xD7FF).to_a + (0xE000..0x10FFFF).to_a).pack('U*'); 1
uniq_encodings =
Encoding.name_list -
Encoding.aliases.keys -
%w[locale external filesystem internal]
encoded_strings =
uniq_encodings.map do |enc|
full_string.encode(enc, invalid: :replace, undef: :replace, replace: '')
rescue => e
puts e
end; 1
</code></pre>
<p>This prints about 10 "converter not found" errors, such as <code>code converter not found (UTF-8 to UTF-7)</code>, but I guess this is expected.</p>
<p>Some of the converters seem to output invalid strings, though:</p>
<pre><code>encoded_strings.each do |str|
str&.codepoints
rescue => e
puts e
end; 1
</code></pre>
<p>This will print <code>invalid byte sequence in {Big5HKSCS,Big5-UAO,CP950,CP951}</code>.</p>
<p>Looking for example at the generated CP950 string, 8031 of its 25342 characters are invalid, spread across 2017 distinct ranges in the string. The invalid characters' codepoints are all in the range of 0x81..0xFE.</p>
<p>Is this a bug?</p>
<p>I would expect <code>String#encode</code> with <code>invalid: :replace, undef: :replace</code> not to create invalid byte sequences, but maybe I am misunderstanding these encodings and this is an unavoidable issue?</p>
<p>CC <a class="user active user-mention" href="https://bugs.ruby-lang.org/users/50">@duerst (Martin Dürst)</a></p> Ruby master - Bug #18337 (Assigned): Ruby allows zero-width characters in identifiershttps://bugs.ruby-lang.org/issues/183372021-11-15T00:14:21Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<p>Ruby allows zero-width characters in identifiers, which can be shown with the following small test:</p>
<p>irb(main):001:0> script = "ab = 20; a\u200Bb = 30; puts ab;"<br>
=> "ab = 20; ab = 30; puts ab;"<br>
irb(main):002:0> eval(script)<br>
20<br>
=> nil</p>
<p>The first line creates the script. It contains a zero-width space (ZWSP), but that's not visible in most contexts (see next line). Looking at the script, one expects 30 as an output, but the output is 20 because there are two variables involved, one with a ZWSP and one without. I propose we fix this by disallowing such characters in identifiers. I'll give more details in a followup.</p> Ruby master - Bug #17400 (Open): Incorrect character downcase for Greek Sigmahttps://bugs.ruby-lang.org/issues/174002020-12-16T23:47:34Zxfalcox (Rafael Silva)xfalcox@gmail.com
<p>An issue caused by this bug was first reported at Discourse support community at <a href="https://meta.discourse.org/t/unicode-username-results-in-error-loading-profile-page/173182?u=falco" class="external">https://meta.discourse.org/t/unicode-username-results-in-error-loading-profile-page/173182?u=falco</a>.</p>
<p>The issue is that in Greek, there are two ways to downcase the letter ‘Σ’</p>
<ul>
<li>‘ς’ when it is used at the end of a word</li>
<li>‘σ’ anywhere else</li>
</ul>
<p>NodeJS follows this rule:</p>
<pre><code>➜ node
Welcome to Node.js v12.11.1.
Type ".help" for more information.
> "ΣΠΥΡΟΣ".toLowerCase()
'σπυρος'
</code></pre>
<p>Python too:</p>
<pre><code>➜ python
Python 3.8.2 (default, Nov 23 2020, 16:33:30)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "ΣΠΥΡΟΣ".lower()
'σπυρος'
</code></pre>
<p>Ruby (both 2.7 and 3) doesn't:</p>
<pre><code>➜ ruby --version
ruby 3.0.0dev (2020-12-16T18:46:44Z master 93ba3ac036) [x86_64-linux]
➜ irb
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
</code></pre>
<pre><code>➜ ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-linux]
➜ irb
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
</code></pre> Ruby master - Bug #13671 (Assigned): Regexp with lookbehind and case-insensitivity raises RegexpE...https://bugs.ruby-lang.org/issues/136712017-06-22T23:28:58Zdschweisguth (Dave Schweisguth)dave@schweisguth.org
<p>Here is a test program:</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="k">def</span> <span class="nf">test</span><span class="p">(</span><span class="n">description</span><span class="p">)</span>
<span class="k">begin</span>
<span class="k">yield</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">description</span><span class="si">}</span><span class="s2"> is OK"</span>
<span class="k">rescue</span> <span class="no">RegexpError</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">description</span><span class="si">}</span><span class="s2"> raises RegexpError"</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"ass, case-insensitive, special"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!ass)/i</span> <span class="o">=~</span> <span class="s1">'✨'</span> <span class="p">}</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"bss, case-insensitive, special"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!bss)/i</span> <span class="o">=~</span> <span class="s1">'✨'</span> <span class="p">}</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"as, case-insensitive, special"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!as)/i</span> <span class="o">=~</span> <span class="s1">'✨'</span> <span class="p">}</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"ss, case-insensitive, special"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!ss)/i</span> <span class="o">=~</span> <span class="s1">'✨'</span> <span class="p">}</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"ass, case-sensitive, special"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!ass)/</span> <span class="o">=~</span> <span class="s1">'✨'</span> <span class="p">}</span>
<span class="nb">test</span><span class="p">(</span><span class="s2">"ass, case-insensitive, regular"</span><span class="p">)</span> <span class="p">{</span> <span class="sr">/(?<!ass)/i</span> <span class="o">=~</span> <span class="s1">'x'</span> <span class="p">}</span>
</code></pre>
<p>Running the test program with Ruby 2.4.1 (macOS) gives</p>
<pre><code>ass, case-insensitive, special raises RegexpError
bss, case-insensitive, special raises RegexpError
as, case-insensitive, special is OK
ss, case-insensitive, special is OK
ass, case-sensitive, special is OK
ass, case-insensitive, regular is OK
</code></pre>
<p>The RegexpError is "invalid pattern in look-behind: /(?<!ass)/i (RegexpError)"</p>
<p>Side note: in the real code in which I found this error I was able to work around the error by using (?i) after the lookbehind instead of //i.</p>
<p>Running the test program with Ruby 2.3.4 does not report any RegexpErrors.</p>
<p>I think this is a regression, although I might be wrong and it might be saving me from an incorrect result with certain strings.</p> Ruby master - Bug #7742 (Open): System encoding (Windows-1258) is not recognized by Ruby to conv...https://bugs.ruby-lang.org/issues/77422013-01-26T15:33:40ZMars (Hong Ha Dang )dhhmars9999@gmail.com
<p>I installed Railsinstaller in win8. After intall complete the screen set to</p>
<blockquote>
<p>configuration Railsinstaller on cmd (step 2). I give user name: DHH Mars and<br>
email: <a href="mailto:dhhma...@gmail.com" class="email">dhhma...@gmail.com</a>. It ran and have following massage:</p>
<p>C:/RailsInstaller/scripts/config_check.rb:64:in 'exist?': code converter not<br>
found <a href="Encoding::ConverterNotFoundError" class="external">Encoding::ConverterNotFoundError</a> from<br>
C:/RailsInstaller/scripts/config_check.rb:64:in 'main'</p>
<p>C:\Sites></p>
</blockquote> Ruby master - Bug #6351 (Assigned): transcode table generator does not support multi characters o...https://bugs.ruby-lang.org/issues/63512012-04-24T20:41:39Zusa (Usaku NAKAMURA)usa@garbagecollect.jp
<p>改めてチケット起こします。<a href="/issues/6349">[ruby-dev:45576]</a> より。</p>
<p>On 2012/04/24 17:11, "Martin J. Dürst" wrote:</p>
<blockquote>
<p>On 2012/04/24 17:02, U.Nakamura wrote:</p>
<blockquote>
<p>データは例によってNetBSDのものが利用できそうです。<br>
なのですが、transcodeってUnicodeの第0面(BMP)以外はサポートし<br>
てましたっけ?</p>
</blockquote>
<p>もちろんです :-)</p>
</blockquote>
<p>もうちょっと調べました。BMP 以外は transcode の最初から全く問題ないです<br>
が、現時点で引っかかるのは次のものです<br>
(<a href="http://x0213.org/codetable/euc-jis-2004-std.txt" class="external">http://x0213.org/codetable/euc-jis-2004-std.txt</a> から抜粋):</p>
<p>0xA4F7 U+304B+309A # [2000]<br>
0xA4F8 U+304D+309A # [2000]<br>
0xA4F9 U+304F+309A # [2000]<br>
0xA4FA U+3051+309A # [2000]<br>
0xA4FB U+3053+309A # [2000]</p>
<p>0xA5F7 U+30AB+309A # [2000]<br>
0xA5F8 U+30AD+309A # [2000]<br>
0xA5F9 U+30AF+309A # [2000]<br>
0xA5FA U+30B1+309A # [2000]<br>
0xA5FB U+30B3+309A # [2000]<br>
0xA5FC U+30BB+309A # [2000]<br>
0xA5FD U+30C4+309A # [2000]<br>
0xA5FE U+30C8+309A # [2000]</p>
<p>0xA6F8 U+31F7+309A # [2000]</p>
<p>0xABC4 U+00E6+0300 # [2000]</p>
<p>0xABC8 U+0254+0300 # [2000]<br>
0xABC9 U+0254+0301 # [2000]<br>
0xABCA U+028C+0300 # [2000]<br>
0xABCB U+028C+0301 # [2000]<br>
0xABCC U+0259+0300 # [2000]<br>
0xABCD U+0259+0301 # [2000]<br>
0xABCE U+025A+0300 # [2000]<br>
0xABCF U+025A+0301 # [2000]</p>
<p>0xABE5 U+02E9+02E5 # [2000]<br>
0xABE6 U+02E5+02E9 # [2000]</p>
<p>ようするに、JIS X 0213 で一文字になっているが、Unicode で二文字になって<br>
いるものです。EUC-JISX0213 から UTF-8 は問題ないですが、逆は現在引っかか<br>
ります。windows-1258 も (逆ですが) 同じ問題がありますので、いずれはなく<br>
さないといけないと思いましたが、今回はいいきっかけのではないかと思います。</p>
<p>よろしくお願いします。 Martin.</p>