Bug #20617
closed/\pArabic/ character property doesn't match certain Arabic characters
Description
I am not sure this is a bug.
On some occasions I have Arabic text, but the Arabic character property rejects it as being Arabic.
Example:
str = "شغل مرحلة أولى ، جداً؟"
/^\p{Arabic}$/.match(str).inspect
# => nil
str.chars.reject {|char| /\p{Arabic}/.match(char)}.uniq
# arabic space, arabic comma, arabic question mark, and arabic fatahan
This isn't a problem, since I defined my own regex to include the missing characters, but wanted to raise it in case it is, in fact, a bug.
Updated by alanwu (Alan Wu) 6 months ago
- Status changed from Open to Closed
The "Arabic" property is a "scripts" property, which doesn't include punctuations: https://www.unicode.org/standard/supported.html
Ruby documentation for Unicode properties is here: https://docs.ruby-lang.org/en/3.3/regexp/unicode_properties_rdoc.html
The Regexp class level documentation has more general information about matching with Unicode properties.
A way to additionally match the punctuations in your test string is by matching their Unicode block:
"شغلمرحلةأولى،جداً؟".chars.all? { /\p{In_Arabic}/.match?(_1) } # => true
Updated by duerst (Martin Dürst) 6 months ago
(\p{In_Arabic}
may not be enough. There are 8 blocks with a name containing 'Arabic'. For details, see e.g. https://www.unicode.org/Public/15.1.0/ucd/Blocks.txt.
They would be selectable with:
\p{In_Arabic}|\p{In_Arabic_Extended_A}|\p{In_Arabic_Extended_B}|\p{In_Arabic_Extended_C}|\p{In_Arabic_Mathematical_Alphabetic_Symbols}|\p{In_Arabic_Presentation_Forms_A}|\p{In_Arabic_Presentation_Forms_B}|\p{In_Arabic_Supplement})
.