Bug #7154

For whatever reason \s doesn't match \u00a0.

Added by Todor Dragnev over 1 year ago. Updated over 1 year ago.

[ruby-core:47963]
Status:Closed
Priority:Normal
Assignee:-
Category:core
Target version:-
ruby -v:1.9.3p286 Backport:

Description

The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert-non-breaking-spaces-to-spaces-in-ruby

I just hit it today.

History

#1 Updated by Martin Dürst over 1 year ago

My understanding is that in Ruby, all the pre-Unicode escapes, and in
particular "\s", still refer only to characters in the ASCII range.

My understanding is that this was done in this way for backwards
compatibility, and on purpose. This can be explained as follows: Maybe
somebody wrote a script doing some processing where they wanted to match
ASCII 'space' characters. They used \s. If Ruby would change \s to
suddenly match way more than before, the meaning of that program would
change. Maybe it would change just in the right way. But maybe it would
change in an unintended way.

So the decision was to not second-guess the programmer. As a result,
this does not behave the same way as what's suggested in Unicode TR #18.
But please note that UTR #18 doesn't require \s to be treated as
Unicode whitespace, it just recommends to do so (see
http://www.unicode.org/reports/tr18/#Compatibility_Properties).

If you want to match against Unicode whitespace, what you should do is
the following:

"\u00a0" =~ /\p{Whitespace}/u

Regards, Martin.

On 2012/10/14 8:37, t0d0r (Todor Dragnev) wrote:

Issue #7154 has been reported by t0d0r (Todor Dragnev).


Bug #7154: For whatever reason \s doesn't match \u00a0.
https://bugs.ruby-lang.org/issues/7154

Author: t0d0r (Todor Dragnev)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: 1.9.3p286

The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert-non-breaking-spaces-to-spaces-in-ruby

I just hit it today.

#2 Updated by Martin Dürst over 1 year ago

  • Status changed from Open to Closed

My understanding is that this is a feature. See previous post for explanation. I hope somebody can provide the feedback to http://stackoverflow.com/questions/2588942/convert-non-breaking-spaces-to-spaces-in-ruby.

#3 Updated by Martin Dürst over 1 year ago

Just forgot to mention that the pickaxe book, for "\s", says "For
Unicode, add Line_Separator codepoints.".

This is wrong because even LINE SEPARATOR itself, \u2028, doesn't match
\s. It would also be wrong in that the result would be to match ASCII
whitespace and Unicode line separators, whereas other Unicode whitespace
would be ignored.

Regards, Martin.

On 2012/10/15 8:51, "Martin J. Dürst" wrote:

My understanding is that in Ruby, all the pre-Unicode escapes, and in
particular "\s", still refer only to characters in the ASCII range.

My understanding is that this was done in this way for backwards
compatibility, and on purpose. This can be explained as follows: Maybe
somebody wrote a script doing some processing where they wanted to match
ASCII 'space' characters. They used \s. If Ruby would change \s to
suddenly match way more than before, the meaning of that program would
change. Maybe it would change just in the right way. But maybe it would
change in an unintended way.

So the decision was to not second-guess the programmer. As a result,
this does not behave the same way as what's suggested in Unicode TR #18.
But please note that UTR #18 doesn't require \s to be treated as
Unicode whitespace, it just recommends to do so (see
http://www.unicode.org/reports/tr18/#Compatibility_Properties).

If you want to match against Unicode whitespace, what you should do is
the following:

"\u00a0" =~ /\p{Whitespace}/u

Regards, Martin.

On 2012/10/14 8:37, t0d0r (Todor Dragnev) wrote:

Issue #7154 has been reported by t0d0r (Todor Dragnev).


Bug #7154: For whatever reason \s doesn't match \u00a0.
https://bugs.ruby-lang.org/issues/7154

Author: t0d0r (Todor Dragnev)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: 1.9.3p286

The problem is already explained here:

http://stackoverflow.com/questions/2588942/convert-non-breaking-spaces-to-spaces-in-ruby

I just hit it today.

#4 Updated by Todor Dragnev over 1 year ago

duerst (Martin Dürst) wrote:

My understanding is that this is a feature. See previous post for explanation. I hope somebody can provide the feedback to http://stackoverflow.com/questions/2588942/convert-non-breaking-spaces-to-spaces-in-ruby.

My understanding is that:

  • We are surrounded by Unicode text, most of the Internet pages and documents are UTF8. If the language don't adapt of the surrounding environment it will be replaced by new one, which provides better tools for the real situation. Not all people of the world use english alphabet as a primary language...

  • We all are humans, reading "white space" for me means white space in the text in that case with \u00a0 I opened hex editor to see whats wrong, I like the simplicity of Ruby and to code less. All good and popular programming languages are oriented to be in help for humans, complexity kill the popularity - did I know someone near you to write Assembler these days?

  • "String".downcase produce "string", "Стринг".downcase must produce "стринг", but it's not. Ok thats correct for 1.8.x - we don't have multibyte support. But why in 1.9.x I need to use specific libraries to receive a proper results. UnicodeUtils.downcase("Стринг") works fine... Thanks Stefan Lang. Maybe Ruby wants to become next PHP with 10 methods doing one think? http://www.tnx.nl/php.html. For me(and maybe others) downcase/upcase/\s and similar methods in 1.9.x are useless... Why we have multibyte support without multi language awareness? This is odd from me as a human...

  • Firefox has a lots of features and now is going to die, because they did't complain with users warnings about memory management... :)

Also available in: Atom PDF