Bug #20025: Parsing identifiers/constants is case-folding dependent - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20025

closed

Parsing identifiers/constants is case-folding dependent

Bug #20025: Parsing identifiers/constants is case-folding dependent

Added by kddnewton (Kevin Newton) over 2 years ago. Updated about 1 year ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

Backport:

3.0: REQUIRED, 3.1: REQUIRED, 3.2: DONE

[ruby-core:115512]

Description

When CRuby parses identifiers, it is encoding-dependent. Once the identifier is found, it determines if it starts with a uppercase or lowercase codepoint. This determines if the identifier is a constant or not.

The function is charge of this is rb_sym_constant_char_p. For non-unicode encodings where the leading byte has the top-bit set, this relies on onigmo's mbc_case_fold to determine if it is a constant or not (as opposed to is_code_ctype).

This works for almost every single codepoint in every encoding, but has one very weird edge case. In the Windows-1253 encoding for the 0xB5 byte, it's the micro sign. The micro sign, when case folded, becomes the uppercase mu character, and then the lowercase mu character, or 0xEC. This means that even though 0xB5 reports itself as being a lowercase codepoint, it gets parsed as a constant. This example might make this more clear:

class Context < BasicObject
  def method_missing(name, *) = :identifier
  def self.const_missing(name) = :constant
end

encoding = Encoding::Windows_1253
character = 0xB5.chr(encoding)

source = "# encoding: #{encoding.name}\n#{character}\n"
result = Context.new.instance_eval(source)

puts "#{encoding.name} encoding of 0x#{character.ord.to_s(16).upcase}"
puts "  [[:alpha:]] => #{character.match?(/[[:alpha:]]/)}"
puts "  [[:alnum:]] => #{character.match?(/[[:alnum:]]/)}"
puts "  [[:upper:]] => #{character.match?(/[[:upper:]]/)}"
puts "  [[:lower:]] => #{character.match?(/[[:lower:]]/)}"
puts "  parsed as #{result}"

this results in the output of:

Windows-1253 encoding of 0xB5
  [[:alpha:]] => true
  [[:alnum:]] => true
  [[:upper:]] => false
  [[:lower:]] => true
  parsed as constant

To be clear, I don't think the case-folding is incorrect here (and @duerst (Martin Dürst) confirms that it is correct). I believe instead that it is incorrect to use case-folding here to determine if a codepoint is uppercase or not.

Note that this only impacts this one codepoint in this one encoding, so I don't believe this is actually a large-scale problem. But I found it surprising, and think we should change it.

Updated by kddnewton (Kevin Newton) over 2 years ago Actions
Copy link
#1 [ruby-core:115520]

I should additionally mention that this is the only codepoint in any encoding that this impacts. I ran a brute-force script to find any violations, please check me work here: https://gist.github.com/kddnewton/089d23d49adb5551792293fdb5bf64a0.

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#2 [ruby-core:115528]

https://github.com/ruby/ruby/pull/9059

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#3 [ruby-core:115529]

https://en.wikipedia.org/wiki/Windows-1253#cite_note-5

This is in addition to the existing μ at 0xEC, which remains in place. Unicode calls the one at 0xB5 "micro sign" (U+00B5) and the one at 0xEC "Greek small letter Mu" (U+03BC), although the former is mapped to the latter by NFKC (although not NFC) Unicode normalization.

The reason is that micro sign is folded to small Mu in Windows-1253.

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#4

Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN to 3.0: REQUIRED, 3.1: REQUIRED, 3.2: REQUIRED

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#5

Status changed from Open to Closed

Applied in changeset git|79eb75a8dd64848f23e9efc465f06326b5d4b680.

[Bug #20025] Check if upper/lower before fallback to case-folding

Updated by duerst (Martin Dürst) over 2 years ago Actions
Copy link
#6 [ruby-core:115584]

@nobu (Nobuyoshi Nakada) (Nobuyoshi Nakada) wrote in #note-3:

The reason is that micro sign is folded to small Mu in Windows-1253.

The micro sign is indeed folded to small mu in windows-1253. The reason is (most probably) that it is also folded this way in Unicode; see https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt. The actual data for this is the '\354' at https://github.com/ruby/ruby/blob/85bc80a51be0ceedcc57e7b6b779e6f8f885859e/enc/windows_1253.c#L67.

P.S.: I really feel like proposing to change all these octal constants to hexadecimal, in order to bring them into the current century and align them with all the other data surrounding character encoding. But I guess that should be a separate issue.

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago Actions
Copy link
#7 [ruby-core:121316]

Backport changed from 3.0: REQUIRED, 3.1: REQUIRED, 3.2: REQUIRED to 3.0: REQUIRED, 3.1: REQUIRED, 3.2: DONE

ruby_3_2 commit:6c24731837f88d67517cfc590cb496daed7a0ef5 merged revision(s) 79eb75a8dd64848f23e9efc465f06326b5d4b680.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #20025

Parsing identifiers/constants is case-folding dependent

Updated by kddnewton (Kevin Newton) over 2 years ago Actions
Copy link
#1 [ruby-core:115520]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#2 [ruby-core:115528]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#3 [ruby-core:115529]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#4

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#5

Updated by duerst (Martin Dürst) over 2 years ago Actions
Copy link
#6 [ruby-core:115584]

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago Actions
Copy link
#7 [ruby-core:121316]

Project

General

Profile

Ruby

Custom queries

Bug #20025

Parsing identifiers/constants is case-folding dependent

Updated by kddnewton (Kevin Newton) over 2 years ago ActionsCopy link #1 [ruby-core:115520]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago ActionsCopy link #2 [ruby-core:115528]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago ActionsCopy link #3 [ruby-core:115529]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago ActionsCopy link #4

Updated by nobu (Nobuyoshi Nakada) over 2 years ago ActionsCopy link #5

Updated by duerst (Martin Dürst) over 2 years ago ActionsCopy link #6 [ruby-core:115584]

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago ActionsCopy link #7 [ruby-core:121316]

Updated by kddnewton (Kevin Newton) over 2 years ago Actions
Copy link
#1 [ruby-core:115520]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#2 [ruby-core:115528]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#3 [ruby-core:115529]

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#4

Updated by nobu (Nobuyoshi Nakada) over 2 years ago Actions
Copy link
#5

Updated by duerst (Martin Dürst) over 2 years ago Actions
Copy link
#6 [ruby-core:115584]

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago Actions
Copy link
#7 [ruby-core:121316]