Project

General

Profile

Bug #17400

Incorrect character downcase for Greek Sigma

Added by xfalcox (Rafael Silva) 3 months ago. Updated 3 months ago.

Status:
Open
Priority:
Normal
Target version:
-
ruby -v:
ruby 3.0.0dev (2020-12-16T18:46:44Z master 93ba3ac036) [x86_64-linux]
[ruby-core:101480]

Description

An issue caused by this bug was first reported at Discourse support community at https://meta.discourse.org/t/unicode-username-results-in-error-loading-profile-page/173182?u=falco.

The issue is that in Greek, there are two ways to downcase the letter ‘Σ’

  • ‘ς’ when it is used at the end of a word
  • ‘σ’ anywhere else

NodeJS follows this rule:

➜  node
Welcome to Node.js v12.11.1.
Type ".help" for more information.
> "ΣΠΥΡΟΣ".toLowerCase()
'σπυρος'

Python too:

➜ python
Python 3.8.2 (default, Nov 23 2020, 16:33:30) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "ΣΠΥΡΟΣ".lower()
'σπυρος'

Ruby (both 2.7 and 3) doesn't:

➜  ruby --version           
ruby 3.0.0dev (2020-12-16T18:46:44Z master 93ba3ac036) [x86_64-linux]
➜  irb           
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
➜  ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-linux]
➜  irb
irb(main):001:0> "ΣΠΥΡΟΣ".downcase
=> "σπυροσ"
#1

Updated by xfalcox (Rafael Silva) 3 months ago

  • Description updated (diff)
#2

Updated by xfalcox (Rafael Silva) 3 months ago

  • Description updated (diff)

Updated by shyouhei (Shyouhei Urabe) 3 months ago

  • Assignee set to duerst (Martin Dürst)

I guess the ultimate reason why this is not implemented in ruby is the corresponding case mapping is commented out in https://unicode.org/Public/UNIDATA/SpecialCasing.txt

So strictly speaking we just follow what Unicode says but well, I agree this is not optimal. However to implement it "end of a word" has to be determined, which is not that intuitive than it sounds. duerst (Martin Dürst) Any idea?

Updated by mame (Yusuke Endoh) 3 months ago

Just FYI: I found the special handling code of Greek Sigma in v8. Looks like that it checks if the next character (if any) is a letter or not.

https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/unicode.cc#L177-L185

It uses non-final lower sigma even if the next character is Japanese.

$ node
> "Σあ".toLowerCase()
'σあ'

Updated by sam.saffron (Sam Saffron) 3 months ago

Prior art here is:

https://github.com/elixir-lang/elixir/issues/6437

https://github.com/elixir-lang/elixir/pull/6990/files

Rust

https://github.com/rust-lang/rust/issues/26035

Golang

https://github.com/golang/text/blob/master/cases/cases.go#L147-L152


Per https://dotnetfiddle.net/

using System;

public class Program
{
    public static void Main()
    {
        Console.WriteLine("ΣΠΥΡΟΣ".ToLower());
    }
}

.NET handles this.

https://repl.it/languages/java10

class Main {  
  public static void main(String args[]) { 
    System.out.println("ΣΠΥΡΟΣ".toLowerCase()); 
  } 
}

Java handles this correctly

....

Seems annoying to carry this edge case, but the consensus out there is that we should carry it.

Rust ticket talks about about complex cases like apostrophes.

Updated by mame (Yusuke Endoh) 3 months ago

Oops, my understanding seemed to be wrong. Please forget my previous comment. When the next letter is an apostrophe, the further next letter seems to determine, but I don't understand this behavior from the code of v8. I leave it to an expert.

$ node
> "ΑΣ' ΤΟ".toLowerCase()
'ας\' το'
> "ΑΣ'ΤΟ".toLowerCase()
'ασ\'το'

Updated by shyouhei (Shyouhei Urabe) 3 months ago

Yes everybody wants ruby to handle it "correctly". The problem right now is the lack of concrete definition of "correct" here; especially we need a definition of a word boundary.

Updated by mame (Yusuke Endoh) 3 months ago

If the word has a single letter (i.e., "Σ"), toLowerCase returns "σ" instead of "ς" even though the letter is at the end of the word. The condition seems more complex.

$ node
> "Σ".toLowerCase()
'σ'

Updated by sam.saffron (Sam Saffron) 3 months ago

Java has complicated opinions as well:

class Main {  
  public static void main(String args[]) { 

    System.out.println("Σ".toLowerCase());
    System.out.println("ΣΣs".toLowerCase()); 
    System.out.println("ΣΣ".toLowerCase()); 
    System.out.println("ΣΣ, sss".toLowerCase());
    System.out.println("ΣΣ,sss".toLowerCase());
    System.out.println("ΣΣ;sss".toLowerCase());
    System.out.println("ΣΣ:sss".toLowerCase());
    System.out.println("ΣΣ: sss".toLowerCase());
    System.out.println("ΣΣ^sss".toLowerCase());
    System.out.println("ΣΣ.sss".toLowerCase());
    System.out.println("ΣΣ. sss".toLowerCase());
    System.out.println("ΣΣ'sss".toLowerCase());
    System.out.println("ΣΣ$sss".toLowerCase());
  } 
}
σ
σσs
σς
σς, sss
σς,sss
σς;sss
σς:sss
σς: sss
σς^sss
σσ.sss
σς. sss
σσ'sss
σς$sss

full stop (.) and single quote (') are handled differently than comma (,) and other special chars. This is one very tricky edge case.

Updated by duerst (Martin Dürst) 3 months ago

I have to acknowledge that I 'cut some corners'. It's essentially table 3.17 on p. 151/2 of the Unicode Standard (see https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf).

The problem from the implementation side is that it requires context, of possibly unlimited length. The context before the character is somewhat easier to handle ('just' need a little state machine) than the context after the character (which needs lookahead). Another potential problem is that programs using downcase (and capitalize and swapcase) may not give all the necessary context, because they may do this operation in pieces. But that's their problem.

The problem from the user side is that it isn't (and can't be made) perfect, as e.g. the example in https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf shows. I seem to remember that John Cowan also gave another example, where a final sigma (ς) appeared in the middle of a Greek word, at the boundary between two components. I haven't found that example in my archives, but I may get back to John and ask him again.

But using final sigma in whatever Unicode defines as the appropriate context is definitely much closer to what the user may want. I'll try to think about how to improve our implementation, but can't promise to get to it before February, sorry.

Also available in: Atom PDF