Project

General

Profile

Feature #13241

Method(s) to access Unicode properties for characters/strings

Added by duerst (Martin Dürst) over 3 years ago. Updated about 1 year ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:<unknown>]

Description

[This is currently an exploratory proposal.]

Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:

"ABC あ DEF" =~ /\p{hiragana}/

However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:

"Aあア".script => :latin # returns script of first character only

"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values

"Aあア".property(:script) => :latin # returns specified property of first character only

"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values

"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
                        # returns arrays of property values, one array per character

The interface is still in flux, comments welcome!

Implementation depends on #13240.

In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).


Related issues

Related to Ruby master - Feature #13240: Change Unicode property implementation in Onigmo from inversion lists to direct lookupOpenActions
Related to Ruby master - Feature #14618: Add display width method to String for CLIOpenActions

Also available in: Atom PDF