Project

General

Profile

Actions

Feature #13241

open

Method(s) to access Unicode properties for characters/strings

Added by duerst (Martin Dürst) over 7 years ago. Updated over 5 years ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:<unknown>]

Description

[This is currently an exploratory proposal.]

Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:

"ABC あ DEF" =~ /\p{hiragana}/

However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:

"Aあア".script => :latin # returns script of first character only

"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values

"Aあア".property(:script) => :latin # returns specified property of first character only

"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values

"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
                        # returns arrays of property values, one array per character

The interface is still in flux, comments welcome!

Implementation depends on #13240.

In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).


Related issues 2 (2 open0 closed)

Related to Ruby master - Feature #13240: Change Unicode property implementation in Onigmo from inversion lists to direct lookupOpenActions
Related to Ruby master - Feature #14618: Add display width method to String for CLIOpenActions
Actions #1

Updated by duerst (Martin Dürst) over 7 years ago

  • Related to Feature #13240: Change Unicode property implementation in Onigmo from inversion lists to direct lookup added
Actions #2

Updated by matz (Yukihiro Matsumoto) over 7 years ago

I am neutral about the proposal, but the method names are too generic. It should be prefixed by unicode_ for example.

Matz.

Updated by rbjl (Jan Lelis) over 7 years ago

Great idea, I'd love to have such capabilities built into the language!

I've recently build this for scripts, blocks, and general categories on Ruby level (see https://github.com/janlelis/unicode-scripts), so let me share some thoughts on the API:

  • I think, it should be always plural methods which return a list of properties used in the string, since Ruby does not distinguish between single characters and strings. The first example would then rather be: "Aあア".scripts => [:hiragana, :katakana, :latin] (like the fourth example). I find it better that it would always return an array than being confused by the fact that it would only consider the first character.
  • With the same reasoning, I would go for having only a properties method, and no singular property method
  • Although I kind of like the .properties([:script, :general_category]) API, it can be a little confusing when using the proposed plural methods approach: It implicitly switches its mode of operation to character by character, soley based on the passed argument being an array. I'd suggest to make this explicit, maybe by using another method such as .each_properties, just going with each_char.properties (probably cannot get optimized properly), or using a keyword argument like by_char: true
  • Should there be only a .properties method (which could be used with scripts, blocks, general categories, etc.) or should there also be individual methods (like .scripts, .blocks, …)? I think both ways would be acceptable, but I like the idea of having individual methods for the most important properties.
  • A little more bikeshedding: Maybe the properties should be returned as strings instead of symbols. They represent some kind of data, so to me it feels like strings are the more appropriate choice. Another example, if we have such functionality for blocks as well, "Miscellaneous Mathematical Symbols-B" would have to returned as a symbol - which just does not look so good. This is only about the values returned, all method arguments would still be symbols/keyword arguments.

What do you all think?

Updated by rbjl (Jan Lelis) over 7 years ago

I think prefixing such methods with unicode_ would be no problem. While it's a little verbose, it still reads good:

  • "bla".unicode_scripts
  • "blubb".unicode_properties(:general_categories)

and so on. Also it is consistent with the unicode_normalize API.

Updated by shevegen (Robert A. Heiler) over 7 years ago

Jan Lelis wrote:

I think, it should be always plural methods which return a list of properties used in the
string, since Ruby does not distinguish between single characters and strings. The first
example would then rather be: "Aあア".scripts => [:hiragana, :katakana, :latin] (like the
fourth example).

I agree in the sense that your example given makes more sense than the first example,
where:

"Aあア".script => :latin # returns script of first character only

Only returned one result. I understand it was just an example, but it confused me because
I wondered what happened to the other characters?

I like the name "property" or "properties" more than "script" - script sounds a bit
non-descript (pun intended!).

Since matz said that it should be indicative of unicode, e. g. with a unicode_prefix,
the example by Jan Lelis would seem good:

"string here".unicode_properties(optional_args)

Other name suggestions:

.unciode_category
.unciode_categories
.unciode_tokenset
.unciode_token_set
.unciode_tokens

And similar perhaps.

PS: By the way, what should it return for an empty string like ""? Or numbers
or similar semi-common tokens?

Actions #6

Updated by duerst (Martin Dürst) over 6 years ago

  • Related to Feature #14618: Add display width method to String for CLI added

Updated by Dan0042 (Daniel DeLorme) over 5 years ago

I had a go at this, and a naive implementation is quite simple. The only issue really is where to store the list of unicode properties.

class String
  def unicode_properties(*categs)
    @@props ||= Hash.new.tap do |hash|
      categ = nil
      #downloaded from https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/UnicodeProps.txt
      txt = File.read(File.expand_path('../UnicodeProps.txt',__FILE__))
      txt.scan(/^\* (\S+)|^    (\S.*)/) do |c,prop|
        hash[categ=c.to_sym] = {} if c
        hash[categ][prop.to_sym] = /\p{#{prop}}/ rescue next if prop
      end
    end
    categs = @@props.keys - [:DerivedAges] if categs.empty?
    result = []
    categs.each do |categ|
      @@props[categ]&.each do |prop,rx|
        result << prop if self =~ rx
      end
    end
    result
  end
end

"ſ".unicode_properties #=> [:Alpha, :Graph, :Lower, :Print, :Word, :Alnum, :Any, :Assigned, :L, :LC, :Ll, :Latin, :Alphabetic, :Cased, :Changes_When_Casefolded, :Changes_When_Casemapped, :Changes_When_Titlecased, :Changes_When_Uppercased, :Grapheme_Base, :ID_Continue, :ID_Start, :Lowercase, :XID_Continue, :XID_Start, :CWCF, :CWCM, :CWT, :CWU, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Latn, :In_Latin_Extended_A]

"ſ".unicode_properties(:DerivedAges) #=> [:"Age=1.1", :"Age=10.0", :"Age=2.0", :"Age=2.1", :"Age=3.0", :"Age=3.1", :"Age=3.2", :"Age=4.0", :"Age=4.1", :"Age=5.0", :"Age=5.1", :"Age=5.2", :"Age=6.0", :"Age=6.1", :"Age=6.2", :"Age=6.3", :"Age=7.0", :"Age=8.0", :"Age=9.0"]

"あ".unicode_properties #=> [:Alpha, :Graph, :Print, :Word, :Alnum, :Any, :Assigned, :L, :Lo, :Hiragana, :Alphabetic, :Grapheme_Base, :ID_Continue, :ID_Start, :XID_Continue, :XID_Start, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Hira, :In_Hiragana]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0