Feature #13241
openMethod(s) to access Unicode properties for characters/strings
Description
[This is currently an exploratory proposal.]
Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:
"ABC あ DEF" =~ /\p{hiragana}/
However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:
"Aあア".script => :latin # returns script of first character only
"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values
"Aあア".property(:script) => :latin # returns specified property of first character only
"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values
"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
# returns arrays of property values, one array per character
The interface is still in flux, comments welcome!
Implementation depends on #13240.
In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).
Updated by duerst (Martin Dürst) almost 8 years ago
- Related to Feature #13240: Change Unicode property implementation in Onigmo from inversion lists to direct lookup added
Updated by matz (Yukihiro Matsumoto) almost 8 years ago
I am neutral about the proposal, but the method names are too generic. It should be prefixed by unicode_
for example.
Matz.
Updated by rbjl (Jan Lelis) almost 8 years ago
Great idea, I'd love to have such capabilities built into the language!
I've recently build this for scripts, blocks, and general categories on Ruby level (see https://github.com/janlelis/unicode-scripts), so let me share some thoughts on the API:
- I think, it should be always plural methods which return a list of properties used in the string, since Ruby does not distinguish between single characters and strings. The first example would then rather be:
"Aあア".scripts => [:hiragana, :katakana, :latin]
(like the fourth example). I find it better that it would always return an array than being confused by the fact that it would only consider the first character. - With the same reasoning, I would go for having only a
properties
method, and no singularproperty
method - Although I kind of like the
.properties([:script, :general_category])
API, it can be a little confusing when using the proposed plural methods approach: It implicitly switches its mode of operation to character by character, soley based on the passed argument being an array. I'd suggest to make this explicit, maybe by using another method such as.each_properties
, just going witheach_char.properties
(probably cannot get optimized properly), or using a keyword argument likeby_char: true
- Should there be only a
.properties
method (which could be used with scripts, blocks, general categories, etc.) or should there also be individual methods (like.scripts
,.blocks
, …)? I think both ways would be acceptable, but I like the idea of having individual methods for the most important properties. - A little more bikeshedding: Maybe the properties should be returned as strings instead of symbols. They represent some kind of data, so to me it feels like strings are the more appropriate choice. Another example, if we have such functionality for blocks as well, "Miscellaneous Mathematical Symbols-B" would have to returned as a symbol - which just does not look so good. This is only about the values returned, all method arguments would still be symbols/keyword arguments.
What do you all think?
Updated by rbjl (Jan Lelis) almost 8 years ago
I think prefixing such methods with unicode_
would be no problem. While it's a little verbose, it still reads good:
"bla".unicode_scripts
"blubb".unicode_properties(:general_categories)
and so on. Also it is consistent with the unicode_normalize
API.
Updated by shevegen (Robert A. Heiler) almost 8 years ago
Jan Lelis wrote:
I think, it should be always plural methods which return a list of properties used in the
string, since Ruby does not distinguish between single characters and strings. The first
example would then rather be: "Aあア".scripts => [:hiragana, :katakana, :latin] (like the
fourth example).
I agree in the sense that your example given makes more sense than the first example,
where:
"Aあア".script => :latin # returns script of first character only
Only returned one result. I understand it was just an example, but it confused me because
I wondered what happened to the other characters?
I like the name "property" or "properties" more than "script" - script sounds a bit
non-descript (pun intended!).
Since matz said that it should be indicative of unicode, e. g. with a unicode_prefix,
the example by Jan Lelis would seem good:
"string here".unicode_properties(optional_args)
Other name suggestions:
.unciode_category
.unciode_categories
.unciode_tokenset
.unciode_token_set
.unciode_tokens
And similar perhaps.
PS: By the way, what should it return for an empty string like ""? Or numbers
or similar semi-common tokens?
Updated by duerst (Martin Dürst) almost 7 years ago
- Related to Feature #14618: Add display width method to String for CLI added
Updated by Dan0042 (Daniel DeLorme) over 5 years ago
I had a go at this, and a naive implementation is quite simple. The only issue really is where to store the list of unicode properties.
class String
def unicode_properties(*categs)
@@props ||= Hash.new.tap do |hash|
categ = nil
#downloaded from https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/UnicodeProps.txt
txt = File.read(File.expand_path('../UnicodeProps.txt',__FILE__))
txt.scan(/^\* (\S+)|^ (\S.*)/) do |c,prop|
hash[categ=c.to_sym] = {} if c
hash[categ][prop.to_sym] = /\p{#{prop}}/ rescue next if prop
end
end
categs = @@props.keys - [:DerivedAges] if categs.empty?
result = []
categs.each do |categ|
@@props[categ]&.each do |prop,rx|
result << prop if self =~ rx
end
end
result
end
end
"ſ".unicode_properties #=> [:Alpha, :Graph, :Lower, :Print, :Word, :Alnum, :Any, :Assigned, :L, :LC, :Ll, :Latin, :Alphabetic, :Cased, :Changes_When_Casefolded, :Changes_When_Casemapped, :Changes_When_Titlecased, :Changes_When_Uppercased, :Grapheme_Base, :ID_Continue, :ID_Start, :Lowercase, :XID_Continue, :XID_Start, :CWCF, :CWCM, :CWT, :CWU, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Latn, :In_Latin_Extended_A]
"ſ".unicode_properties(:DerivedAges) #=> [:"Age=1.1", :"Age=10.0", :"Age=2.0", :"Age=2.1", :"Age=3.0", :"Age=3.1", :"Age=3.2", :"Age=4.0", :"Age=4.1", :"Age=5.0", :"Age=5.1", :"Age=5.2", :"Age=6.0", :"Age=6.1", :"Age=6.2", :"Age=6.3", :"Age=7.0", :"Age=8.0", :"Age=9.0"]
"あ".unicode_properties #=> [:Alpha, :Graph, :Print, :Word, :Alnum, :Any, :Assigned, :L, :Lo, :Hiragana, :Alphabetic, :Grapheme_Base, :ID_Continue, :ID_Start, :XID_Continue, :XID_Start, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Hira, :In_Hiragana]