Project

General

Profile

Actions

Bug #18590

open

String#downcase and CAPITAL LETTER I WITH DOT ABOVE

Added by andrykonchin (Andrew Konchin) 3 months ago. Updated 3 months ago.

Status:
Open
Priority:
Normal
Target version:
-
[ruby-core:107624]

Description

Downcasing for "İ" character works in an unexpected way:

'İ'.downcase
=> "i̇"

Expected result - downcasing should return "i". Instead, it returns small "i" and additional "dot" character:

'İ'.downcase.chars
=> ["i", "̇"]

According to the standard Unicode case mapping character 'İ'(0130) maps to lowercased 'i' (0069).

0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Updated by mame (Yusuke Endoh) 3 months ago

  • Assignee set to duerst (Martin Dürst)
  • Status changed from Open to Assigned

The document of Unicode case folding (http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) says:

0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

"F" is for "full case folding", and "T" is for "Turkic languages".

String#downcase uses full Unicode case mapping by default (See https://docs.ruby-lang.org/en/3.0/String.html#method-i-downcase). You can get the result you expected by :turkic option.

'İ'.downcase(:turkic).chars
=> ["i"]

Updated by mame (Yusuke Endoh) 3 months ago

@duerst (Martin Dürst) Looks like this document https://www.unicode.org/charts/case/ (which is referred by https://docs.ruby-lang.org/en/master/doc/case_mapping_rdoc.html) says that the lowercase of U+0130 is U+0069. Which is correct?

Updated by andrykonchin (Andrew Konchin) 3 months ago

Thank you for the suggestion.

I am wondering whether String#downcase (when called without arguments) follows only Unicode case mapping rules (as stated in the documentation). Or also the folding ones?

I would expect that a call of String#downcase without arguments uses the one-to-one case mapping rules, that are specified in the UnicodeData.txt file.

Updated by duerst (Martin Dürst) 3 months ago

  • Status changed from Assigned to Closed

andrykonchin (Andrew Konchin) wrote in #note-3:

Thank you for the suggestion.

I am wondering whether String#downcase (when called without arguments) follows only Unicode case mapping rules (as stated in the documentation). Or also the folding ones?

I would expect that a call of String#downcase without arguments uses the one-to-one case mapping rules, that are specified in the UnicodeData.txt file.

It should use the mappings in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt.

And that is 0069 0307 (i.e. 'i' followed by dot above) for 'İ'.downcase.

The data in UnicodeData is restricted to simple case mappings (i.e. mappings that don't change the length of the string in terms of number of codepoints). In Ruby, there is no need for such a restriction. See also https://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/, slide 23.

I'm closing this, because it works as intended/described, as far as I can see.

Updated by andrykonchin (Andrew Konchin) 3 months ago

Thank you for your clarification.

Updated by mame (Yusuke Endoh) 3 months ago

  • Status changed from Closed to Open

@duerst (Martin Dürst) Let me confirm. The rdoc of 3.1 and master refers to https://www.unicode.org/charts/case/.

Default Case Mapping
By default, all of these methods use full Unicode case mapping, which is suitable for most languages. See Unicode Latin Case Chart.

It is not clear to me that the document says "0069 0307 for 'İ'.downcase". Is it okay? Should it be replaced with https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt ?

Updated by mame (Yusuke Endoh) 3 months ago

BTW, the rdoc of String#downcase in 3.1 and master is very less informative, and has a broken link (which is maybe the same issue as #18468). It was changed at f7e266e6d2ccad63e4245a106a80c82ef2b38cbf between 3.0 and 3.1. Personally I strongly prefer the 3.0 style.

Updated by duerst (Martin Dürst) 3 months ago

mame (Yusuke Endoh) wrote in #note-7:

BTW, the rdoc of String#downcase in 3.1 and master is very less informative, and has a broken link (which is maybe the same issue as #18468). It was changed at f7e266e6d2ccad63e4245a106a80c82ef2b38cbf between 3.0 and 3.1. Personally I strongly prefer the 3.0 style.

I also prefer the 3.0 version, but that's probably because I wrote that documentation of these methods (when I implemented them). Anyway, I think the 3.1 way of documenting things could also work, but the options link on each casing method should include a fragment and point to https://ruby-doc.org/core-3.1.0/doc/case_mapping_rdoc.html#label-Default+Case+Mapping, not just to https://ruby-doc.org/core-3.1.0/doc/case_mapping_rdoc.html. @BurdetteLamar

mame (Yusuke Endoh) wrote in #note-6:

@duerst (Martin Dürst) Let me confirm. The rdoc of 3.1 and master refers to https://www.unicode.org/charts/case/.

Default Case Mapping
By default, all of these methods use full Unicode case mapping, which is suitable for most languages. See Unicode Latin Case Chart.

It is not clear to me that the document says "0069 0307 for 'İ'.downcase".

That document does NOT say "0069 0307 for 'İ'.downcase".

Is it okay?

I reported to Unicode that they should check it an clarify how this chart was made.

Should it be replaced with https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt ?

In the Ruby documentation, probably yes. SpecialCasing.txt is an official Unicode data file. The case charts are just a Web page. But the case charts may be easier to understand for non-experts.

Updated by mame (Yusuke Endoh) 3 months ago

duerst (Martin Dürst) wrote in #note-8:

Is it okay?

I reported to Unicode that they should check it an clarify how this chart was made.

I see, thanks!

Should it be replaced with https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt ?

In the Ruby documentation, probably yes. SpecialCasing.txt is an official Unicode data file. The case charts are just a Web page. But the case charts may be easier to understand for non-experts.

It's certainly easy to understand, but if it's wrong, I don't think it's even worth considering.

I wanted to create a PR to fix the document, but I am unsure what document is the best reference for full case mapping. @duerst (Martin Dürst) Could you please fix it? Or should we wait until the chart will be fixed?

Updated by duerst (Martin Dürst) 3 months ago

mame (Yusuke Endoh) wrote in #note-9:

I wanted to create a PR to fix the document, but I am unsure what document is the best reference for full case mapping. @duerst (Martin Dürst) Could you please fix it? Or should we wait until the chart will be fixed?

The best reference is section 3.13 (Default Case Algorithms) of https://www.unicode.org/versions/latest/ch03.pdf. This is a lot of text, not as easy to understand as a table. But maybe this is better. People don't need a table, it's easy to create one with Ruby :-).
[Please not that this URI currently redirects to https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf, but I still have to upgrade Ruby to Unicode 14.0.0; hope to be able to do this in the next couple weeks.]

Actions

Also available in: Atom PDF