Bug #17400
openIncorrect character downcase for Greek Sigma
Description
An issue caused by this bug was first reported at Discourse support community at https://meta.discourse.org/t/unicode-username-results-in-error-loading-profile-page/173182?u=falco.
The issue is that in Greek, there are two ways to downcase the letter ‘Σ’
- ‘ς’ when it is used at the end of a word
- ‘σ’ anywhere else
NodeJS follows this rule:
➜ node
Welcome to Node.js v12.11.1.
Type ".help" for more information.
> "ΣΠΥΡΟΣ".toLowerCase()
'σπυρος'
Python too:
➜ python
Python 3.8.2 (default, Nov 23 2020, 16:33:30)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "ΣΠΥΡΟΣ".lower()
'σπυρος'
Ruby (both 2.7 and 3) doesn't:
➜ ruby --version
ruby 3.0.0dev (2020-12-16T18:46:44Z master 93ba3ac036) [x86_64-linux]
➜ irb
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
➜ ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-linux]
➜ irb
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
Updated by shyouhei (Shyouhei Urabe) about 4 years ago
- Assignee set to duerst (Martin Dürst)
I guess the ultimate reason why this is not implemented in ruby is the corresponding case mapping is commented out in https://unicode.org/Public/UNIDATA/SpecialCasing.txt
So strictly speaking we just follow what Unicode says but well, I agree this is not optimal. However to implement it "end of a word" has to be determined, which is not that intuitive than it sounds. @duerst (Martin Dürst) Any idea?
Updated by mame (Yusuke Endoh) about 4 years ago
Just FYI: I found the special handling code of Greek Sigma in v8. Looks like that it checks if the next character (if any) is a letter or not.
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/unicode.cc#L177-L185
It uses non-final lower sigma even if the next character is Japanese.
$ node
> "Σあ".toLowerCase()
'σあ'
Updated by sam.saffron (Sam Saffron) about 4 years ago
Prior art here is:
https://github.com/elixir-lang/elixir/issues/6437
https://github.com/elixir-lang/elixir/pull/6990/files
Rust
https://github.com/rust-lang/rust/issues/26035
Golang
https://github.com/golang/text/blob/master/cases/cases.go#L147-L152
using System;
public class Program
{
public static void Main()
{
Console.WriteLine("ΣΠΥΡΟΣ".ToLower());
}
}
.NET handles this.
https://repl.it/languages/java10
class Main {
public static void main(String args[]) {
System.out.println("ΣΠΥΡΟΣ".toLowerCase());
}
}
Java handles this correctly
....
Seems annoying to carry this edge case, but the consensus out there is that we should carry it.
Rust ticket talks about about complex cases like apostrophes.
Updated by mame (Yusuke Endoh) about 4 years ago
Oops, my understanding seemed to be wrong. Please forget my previous comment. When the next letter is an apostrophe, the further next letter seems to determine, but I don't understand this behavior from the code of v8. I leave it to an expert.
$ node
> "ΑΣ' ΤΟ".toLowerCase()
'ας\' το'
> "ΑΣ'ΤΟ".toLowerCase()
'ασ\'το'
Updated by shyouhei (Shyouhei Urabe) about 4 years ago
Yes everybody wants ruby to handle it "correctly". The problem right now is the lack of concrete definition of "correct" here; especially we need a definition of a word boundary.
Updated by mame (Yusuke Endoh) about 4 years ago
If the word has a single letter (i.e., "Σ"
), toLowerCase returns "σ"
instead of "ς"
even though the letter is at the end of the word. The condition seems more complex.
$ node
> "Σ".toLowerCase()
'σ'
Updated by sam.saffron (Sam Saffron) about 4 years ago
Java has complicated opinions as well:
class Main {
public static void main(String args[]) {
System.out.println("Σ".toLowerCase());
System.out.println("ΣΣs".toLowerCase());
System.out.println("ΣΣ".toLowerCase());
System.out.println("ΣΣ, sss".toLowerCase());
System.out.println("ΣΣ,sss".toLowerCase());
System.out.println("ΣΣ;sss".toLowerCase());
System.out.println("ΣΣ:sss".toLowerCase());
System.out.println("ΣΣ: sss".toLowerCase());
System.out.println("ΣΣ^sss".toLowerCase());
System.out.println("ΣΣ.sss".toLowerCase());
System.out.println("ΣΣ. sss".toLowerCase());
System.out.println("ΣΣ'sss".toLowerCase());
System.out.println("ΣΣ$sss".toLowerCase());
}
}
σ
σσs
σς
σς, sss
σς,sss
σς;sss
σς:sss
σς: sss
σς^sss
σσ.sss
σς. sss
σσ'sss
σς$sss
full stop (.) and single quote (') are handled differently than comma (,) and other special chars. This is one very tricky edge case.
Updated by duerst (Martin Dürst) about 4 years ago
I have to acknowledge that I 'cut some corners'. It's essentially table 3.17 on p. 151/2 of the Unicode Standard (see https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf).
The problem from the implementation side is that it requires context, of possibly unlimited length. The context before the character is somewhat easier to handle ('just' need a little state machine) than the context after the character (which needs lookahead). Another potential problem is that programs using downcase (and capitalize and swapcase) may not give all the necessary context, because they may do this operation in pieces. But that's their problem.
The problem from the user side is that it isn't (and can't be made) perfect, as e.g. the example in https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf shows. I seem to remember that John Cowan also gave another example, where a final sigma (ς) appeared in the middle of a Greek word, at the boundary between two components. I haven't found that example in my archives, but I may get back to John and ask him again.
But using final sigma in whatever Unicode defines as the appropriate context is definitely much closer to what the user may want. I'll try to think about how to improve our implementation, but can't promise to get to it before February, sorry.
Updated by katiecaballero2023 (Katie Caballero) about 1 year ago
RubyConf Hack Day: Bug still exists in 3.2.2
Updated by hsbt (Hiroshi SHIBATA) 9 months ago
- Status changed from Open to Assigned