Bug #20025
closedParsing identifiers/constants is case-folding dependent
Description
When CRuby parses identifiers, it is encoding-dependent. Once the identifier is found, it determines if it starts with a uppercase or lowercase codepoint. This determines if the identifier is a constant or not.
The function is charge of this is rb_sym_constant_char_p
. For non-unicode encodings where the leading byte has the top-bit set, this relies on onigmo's mbc_case_fold
to determine if it is a constant or not (as opposed to is_code_ctype
).
This works for almost every single codepoint in every encoding, but has one very weird edge case. In the Windows-1253 encoding for the 0xB5 byte, it's the micro sign. The micro sign, when case folded, becomes the uppercase mu character, and then the lowercase mu character, or 0xEC. This means that even though 0xB5 reports itself as being a lowercase codepoint, it gets parsed as a constant. This example might make this more clear:
class Context < BasicObject
def method_missing(name, *) = :identifier
def self.const_missing(name) = :constant
end
encoding = Encoding::Windows_1253
character = 0xB5.chr(encoding)
source = "# encoding: #{encoding.name}\n#{character}\n"
result = Context.new.instance_eval(source)
puts "#{encoding.name} encoding of 0x#{character.ord.to_s(16).upcase}"
puts " [[:alpha:]] => #{character.match?(/[[:alpha:]]/)}"
puts " [[:alnum:]] => #{character.match?(/[[:alnum:]]/)}"
puts " [[:upper:]] => #{character.match?(/[[:upper:]]/)}"
puts " [[:lower:]] => #{character.match?(/[[:lower:]]/)}"
puts " parsed as #{result}"
this results in the output of:
Windows-1253 encoding of 0xB5
[[:alpha:]] => true
[[:alnum:]] => true
[[:upper:]] => false
[[:lower:]] => true
parsed as constant
To be clear, I don't think the case-folding is incorrect here (and @duerst (Martin Dürst) confirms that it is correct). I believe instead that it is incorrect to use case-folding here to determine if a codepoint is uppercase or not.
Note that this only impacts this one codepoint in this one encoding, so I don't believe this is actually a large-scale problem. But I found it surprising, and think we should change it.