Feature #13241: Method(s) to access Unicode properties for characters/strings - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #13241

open

Method(s) to access Unicode properties for characters/strings

Feature #13241: Method(s) to access Unicode properties for characters/strings

Added by duerst (Martin Dürst) almost 9 years ago. Updated over 6 years ago.

Status:

Open

Assignee:

Target version:

[ruby-core:<unknown>]

Description

[This is currently an exploratory proposal.]

Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:

"ABC あ DEF" =~ /\p{hiragana}/

However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:

"Aあア".script => :latin # returns script of first character only

"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values

"Aあア".property(:script) => :latin # returns specified property of first character only

"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values

"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
                        # returns arrays of property values, one array per character

The interface is still in flux, comments welcome!

Implementation depends on #13240.

In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).

Related issues 2 (2 open — 0 closed)

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#1

Related to Feature #13240: Change Unicode property implementation in Onigmo from inversion lists to direct lookup added

Updated by matz (Yukihiro Matsumoto) almost 9 years ago Actions
Copy link
#2

I am neutral about the proposal, but the method names are too generic. It should be prefixed by unicode_ for example.

Matz.

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#3 [ruby-core:79671]

Great idea, I'd love to have such capabilities built into the language!

I've recently build this for scripts, blocks, and general categories on Ruby level (see https://github.com/janlelis/unicode-scripts), so let me share some thoughts on the API:

I think, it should be always plural methods which return a list of properties used in the string, since Ruby does not distinguish between single characters and strings. The first example would then rather be: "Aあア".scripts => [:hiragana, :katakana, :latin] (like the fourth example). I find it better that it would always return an array than being confused by the fact that it would only consider the first character.
With the same reasoning, I would go for having only a properties method, and no singular property method
Although I kind of like the .properties([:script, :general_category]) API, it can be a little confusing when using the proposed plural methods approach: It implicitly switches its mode of operation to character by character, soley based on the passed argument being an array. I'd suggest to make this explicit, maybe by using another method such as .each_properties, just going with each_char.properties (probably cannot get optimized properly), or using a keyword argument like by_char: true
Should there be only a .properties method (which could be used with scripts, blocks, general categories, etc.) or should there also be individual methods (like .scripts, .blocks, …)? I think both ways would be acceptable, but I like the idea of having individual methods for the most important properties.
A little more bikeshedding: Maybe the properties should be returned as strings instead of symbols. They represent some kind of data, so to me it feels like strings are the more appropriate choice. Another example, if we have such functionality for blocks as well, "Miscellaneous Mathematical Symbols-B" would have to returned as a symbol - which just does not look so good. This is only about the values returned, all method arguments would still be symbols/keyword arguments.

What do you all think?

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#4 [ruby-core:79673]

I think prefixing such methods with unicode_ would be no problem. While it's a little verbose, it still reads good:

"bla".unicode_scripts
"blubb".unicode_properties(:general_categories)

and so on. Also it is consistent with the unicode_normalize API.

Updated by shevegen (Robert A. Heiler) almost 9 years ago Actions
Copy link
#5 [ruby-core:79693]

Jan Lelis wrote:

I think, it should be always plural methods which return a list of properties used in the
string, since Ruby does not distinguish between single characters and strings. The first
example would then rather be: "Aあア".scripts => [:hiragana, :katakana, :latin] (like the
fourth example).

I agree in the sense that your example given makes more sense than the first example,
where:

"Aあア".script => :latin # returns script of first character only

Only returned one result. I understand it was just an example, but it confused me because
I wondered what happened to the other characters?

I like the name "property" or "properties" more than "script" - script sounds a bit
non-descript (pun intended!).

Since matz said that it should be indicative of unicode, e. g. with a unicode_prefix,
the example by Jan Lelis would seem good:

"string here".unicode_properties(optional_args)

Other name suggestions:

.unciode_category
.unciode_categories
.unciode_tokenset
.unciode_token_set
.unciode_tokens

And similar perhaps.

PS: By the way, what should it return for an empty string like ""? Or numbers
or similar semi-common tokens?

Updated by duerst (Martin Dürst) over 7 years ago Actions
Copy link
#6

Related to Feature #14618: Add display width method to String for CLI added

Updated by Dan0042 (Daniel DeLorme) over 6 years ago Actions
Copy link
#7 [ruby-core:94206]

I had a go at this, and a naive implementation is quite simple. The only issue really is where to store the list of unicode properties.

class String
  def unicode_properties(*categs)
    @@props ||= Hash.new.tap do |hash|
      categ = nil
      #downloaded from https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/UnicodeProps.txt
      txt = File.read(File.expand_path('../UnicodeProps.txt',__FILE__))
      txt.scan(/^\* (\S+)|^    (\S.*)/) do |c,prop|
        hash[categ=c.to_sym] = {} if c
        hash[categ][prop.to_sym] = /\p{#{prop}}/ rescue next if prop
      end
    end
    categs = @@props.keys - [:DerivedAges] if categs.empty?
    result = []
    categs.each do |categ|
      @@props[categ]&.each do |prop,rx|
        result << prop if self =~ rx
      end
    end
    result
  end
end

"ſ".unicode_properties #=> [:Alpha, :Graph, :Lower, :Print, :Word, :Alnum, :Any, :Assigned, :L, :LC, :Ll, :Latin, :Alphabetic, :Cased, :Changes_When_Casefolded, :Changes_When_Casemapped, :Changes_When_Titlecased, :Changes_When_Uppercased, :Grapheme_Base, :ID_Continue, :ID_Start, :Lowercase, :XID_Continue, :XID_Start, :CWCF, :CWCM, :CWT, :CWU, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Latn, :In_Latin_Extended_A]

"ſ".unicode_properties(:DerivedAges) #=> [:"Age=1.1", :"Age=10.0", :"Age=2.0", :"Age=2.1", :"Age=3.0", :"Age=3.1", :"Age=3.2", :"Age=4.0", :"Age=4.1", :"Age=5.0", :"Age=5.1", :"Age=5.2", :"Age=6.0", :"Age=6.1", :"Age=6.2", :"Age=6.3", :"Age=7.0", :"Age=8.0", :"Age=9.0"]

"あ".unicode_properties #=> [:Alpha, :Graph, :Print, :Word, :Alnum, :Any, :Assigned, :L, :Lo, :Hiragana, :Alphabetic, :Grapheme_Base, :ID_Continue, :ID_Start, :XID_Continue, :XID_Start, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Hira, :In_Hiragana]

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Feature #13241

Method(s) to access Unicode properties for characters/strings

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#1

Updated by matz (Yukihiro Matsumoto) almost 9 years ago Actions
Copy link
#2

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#3 [ruby-core:79671]

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#4 [ruby-core:79673]

Updated by shevegen (Robert A. Heiler) almost 9 years ago Actions
Copy link
#5 [ruby-core:79693]

Updated by duerst (Martin Dürst) over 7 years ago Actions
Copy link
#6

Updated by Dan0042 (Daniel DeLorme) over 6 years ago Actions
Copy link
#7 [ruby-core:94206]

Project

General

Profile

Ruby

Tags

Custom queries

Feature #13241

Method(s) to access Unicode properties for characters/strings

Updated by duerst (Martin Dürst) almost 9 years ago ActionsCopy link #1

Updated by matz (Yukihiro Matsumoto) almost 9 years ago ActionsCopy link #2

Updated by rbjl (Jan Lelis) almost 9 years ago ActionsCopy link #3 [ruby-core:79671]

Updated by rbjl (Jan Lelis) almost 9 years ago ActionsCopy link #4 [ruby-core:79673]

Updated by shevegen (Robert A. Heiler) almost 9 years ago ActionsCopy link #5 [ruby-core:79693]

Updated by duerst (Martin Dürst) over 7 years ago ActionsCopy link #6

Updated by Dan0042 (Daniel DeLorme) over 6 years ago ActionsCopy link #7 [ruby-core:94206]

Updated by duerst (Martin Dürst) almost 9 years ago Actions
Copy link
#1

Updated by matz (Yukihiro Matsumoto) almost 9 years ago Actions
Copy link
#2

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#3 [ruby-core:79671]

Updated by rbjl (Jan Lelis) almost 9 years ago Actions
Copy link
#4 [ruby-core:79673]

Updated by shevegen (Robert A. Heiler) almost 9 years ago Actions
Copy link
#5 [ruby-core:79693]

Updated by duerst (Martin Dürst) over 7 years ago Actions
Copy link
#6

Updated by Dan0042 (Daniel DeLorme) over 6 years ago Actions
Copy link
#7 [ruby-core:94206]