Feature #2350

Unicode specific functionality on String in 1.9

Added by Manfred Stienstra about 2 years ago. Updated 9 months ago.

[ruby-core:26650]
Status:Rejected Start date:11/09/2009
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:-
Target version:-

Description

I was wondering is there are any plans to include Unicode aware methods for Unicode encodings on String? For example, upcase and downcase only handle ASCII characters at the moment.

cafe = "Café"
cafe.encoding # => #<Encoding:UTF-8>
"Café".upcase # => CAFé

signature.asc (207 Bytes) Cezary Baginski, 03/23/2011 03:23 am

signature.asc (207 Bytes) Cezary Baginski, 04/12/2011 08:17 pm

signature.asc (207 Bytes) Cezary Baginski, 04/12/2011 08:18 pm


Related issues

related to ruby-trunk - Feature #2034: Consider the ICU Library for Improving and Expanding Unic... Assigned 09/03/2009
related to ruby-trunk - Bug #4549: Can't start class names with non us-ascii chars Rejected 04/02/2011

History

Updated by Yukihiro Matsumoto about 2 years ago

Hi,

In message "Re: [ruby-core:26650] [Feature #2350] Unicode specific functionality on String in 1.9"
    on Mon, 9 Nov 2009 23:29:42 +0900, Manfred Stienstra <redmine@ruby-lang.org> writes:

|I was wondering is there are any plans to include Unicode aware methods for Unicode encodings on String? For example, upcase and downcase only handle ASCII characters at the moment.
|
|cafe = "Café"
|cafe.encoding # => #<Encoding:UTF-8>
|"Café".upcase # => CAFé

As far as I understand, the Unicode case conversion requires
additional language information for e.g. Turkish i.  And some
conversion does not round-trip e.g. German SS.  Use unicode gem
instead.

							matz.

Updated by Manfred Stienstra about 2 years ago

Yes, case conversions require the Unicode database and specific locale implementations. Thank you for your answer!

Updated by Yusuke Endoh almost 2 years ago

  • Status changed from Open to Rejected
Hi,

Matz seemed to reject this ticket, and OP seemed to be satisfied
with matz's answer.  So I close the ticket.

-- 
Yusuke Endoh <mame@tsg.ne.jp>

Updated by Nikolai Weibull almost 2 years ago

On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh <redmine@ruby-lang.org> wrote:
> Issue #2350 has been updated by Yusuke Endoh.

> Matz seemed to reject this ticket, and OP seemed to be satisfied
> with matz's answer.  So I close the ticket.

How would I be able to hook in my character-encodings library into
Ruby 1.9 Strings?  I would like to override, for example, #upcase for
all Strings that have a Unicode encoding.  Is this possible?

Thanks!

Updated by Yui NARUSE almost 2 years ago

(2010/03/26 0:02), Nikolai Weibull wrote:
> On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh<redmine@ruby-lang.org>  wrote:
>> Issue #2350 has been updated by Yusuke Endoh.
>
>> Matz seemed to reject this ticket, and OP seemed to be satisfied
>> with matz's answer.  So I close the ticket.
>
> How would I be able to hook in my character-encodings library into
> Ruby 1.9 Strings?  I would like to override, for example, #upcase for
> all Strings that have a Unicode encoding.  Is this possible?

You can hook String methods, Ruby doesn't forbid it.

But I think, people want both ASCII version and Unicode version of upcase.
So you should name your Unicode methods another names.

-- 
NARUSE, Yui  <naruse@airemix.jp>

Updated by Nikolai Weibull almost 2 years ago

On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
> (2010/03/26 0:02), Nikolai Weibull wrote:
>>
>> On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh<redmine@ruby-lang.org>  wrote:
>>>
>>> Issue #2350 has been updated by Yusuke Endoh.
>>
>>> Matz seemed to reject this ticket, and OP seemed to be satisfied
>>> with matz's answer.  So I close the ticket.
>>
>> How would I be able to hook in my character-encodings library into
>> Ruby 1.9 Strings?  I would like to override, for example, #upcase for
>> all Strings that have a Unicode encoding.  Is this possible?
>
> You can hook String methods, Ruby doesn't forbid it.

Yes, I can do something like

class String
  def unicodify
    extend Encoding::Character::Unicode
  end
end

but I was wondering if there was a way to do it without having to do

String.new.unicodify.upcase

> But I think, people want both ASCII version and Unicode version of upcase.
> So you should name your Unicode methods another names.

Why would they want that?  Having an ASCII-only version of #upcase
makes no sense for a Unicode String more than supporting #upcase
requires that you load the Unicode character database information,
which takes up quite a lot of memory.

I want to transparently deal with this kind of thing.  I know that the
Ruby way is to be explicit about encodings and I actually like that,
but that’s only something I care about at creation, not when invoking
methods on the String.

Updated by Nikolai Weibull 11 months ago

On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
>> (2010/03/26 0:02), Nikolai Weibull wrote:

> I was wondering if there was a way to do it without having to do
>
> String.new.unicodify.upcase

>> But I think, people want both ASCII version and Unicode version of upcase.
>> So you should name your Unicode methods another names.

> Why would they want that?  Having an ASCII-only version of #upcase
> makes no sense for a Unicode String more than supporting #upcase
> requires that you load the Unicode character database information,
> which takes up quite a lot of memory.

So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
absolutely no sense and means that quite a few methods on String are
completely useless in a m18n context.

Updated by Magnus Holm 11 months ago

The problem is that the definition of #upcase doesn't only depend on the
encoding used, but also the language of the encoded text. For instance, if
you're writing in Turkish, you would expect "i".upcase to return a dotted
uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

<http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is
*really* hard and needs to have a lot of flexibility, especially when it
comes to non-Western languages. It's far easier for everyone that the
built-in #upcase is simple and fast and you'll have to be explicit about any
other I18n stuff IMO.

// Magnus Holm


On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote:

> On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
> >> (2010/03/26 0:02), Nikolai Weibull wrote:
>
> > I was wondering if there was a way to do it without having to do
> >
> > String.new.unicodify.upcase
>
> >> But I think, people want both ASCII version and Unicode version of
> upcase.
> >> So you should name your Unicode methods another names.
>
> > Why would they want that?  Having an ASCII-only version of #upcase
> > makes no sense for a Unicode String more than supporting #upcase
> > requires that you load the Unicode character database information,
> > which takes up quite a lot of memory.
>
> So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
> absolutely no sense and means that quite a few methods on String are
> completely useless in a m18n context.
>
>

Updated by Nikolai Weibull 11 months ago

On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> The problem is that the definition of #upcase doesn't only depend on the
> encoding used, but also the language of the encoded text. For instance, if
> you're writing in Turkish, you would expect "i".upcase to return a dotted
> uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

I know.  The same goes for ‘i’ in Lithuanian.

> Doing this properly is *really* hard and needs to have a lot of flexibility,
> especially when it comes to non-Western languages.

This is simply not true.  Unicode defines how to deal with case
conversions.  I’m not saying that the Unicode standard is infallible,
but we can at least adhere to it.  I’m not saying that Unicode is the
only encoding that we should care about, but if we support the Unicode
transfer formats, why not support other interesting parts of the
standard?

> It's far easier for everyone that the built-in #upcase is
> simple and fast and you'll have to be explicit about any
> other I18n stuff IMO.

Easy, perhaps, but hardly useful.

My point is that the current #upcase (and similar methods) is
basically useless for anything other than ASCII.  I was looking for an
actual solution to this problem.  I have a library
(character-encodings) that does support these conversions, based on
locale and the Unicode character database (UCD).  How do we make it
easy for the user to deal with m18n?  I mean, if I say

# -*- coding: utf-8 -*-

puts "äbc".upcase

I expect this to do the right thing for Unicode under the current locale.

As Unicode defines how to deal with case conversions, if I tell Ruby
that “this String is encoded as UTF-8” (or, in this case, “strings in
this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll
use the Unicode rules that govern methods like #upcase for that
String”.

The UCD requires a lot of memory, so I suggested that a library, such
as character-encodings, should be able to seamlessly add this kind of
behavior without requiring the user to write "äbc".unicodify.upcase,
if the UCD can’t be included in standard Ruby runtime.

But, come to think of it, doesn’t Oniguruma need most of the UCD
information, so isn’t most of it already included in the Ruby runtime?
 Adding casing information perhaps wouldn’t require much additional
space.

If this isn’t of interest, then I’m still looking for a way to
override #upcase for Strings that use the UTF-8 encoding without
resorting to alias_method or extend (as shown earlier in this
discussion).  This seems impossible to do at the moment, as Encoding
is a completely opaque object.

Updated by Cezary Baginski 11 months ago

On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:
> On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> > It's far easier for everyone that the built-in #upcase is
> > simple and fast and you'll have to be explicit about any
> > other I18n stuff IMO.
>
> Easy, perhaps, but hardly useful.

A agree - for human interaction it is completely useless. I tend to
think of #upcase as just a convenience method for dealing with ASCII
only system level functionality, e.g. paths on filesystems,
environment variables, html tags, (un)capitalizing to get class names,
database table names, etc.

Anything else is "no-op" or "undefined" for me.

> My point is that the current #upcase (and similar methods) is
> basically useless for anything other than ASCII.

I would probably go one step further and disallow upcase and friends
for any non-US-ASCII string for this reason. At least issue a warning.

> If this isn’t of interest, then I’m still looking for a way to
> override #upcase for Strings that use the UTF-8 encoding without
> resorting to alias_method or extend (as shown earlier in this
> discussion).  This seems impossible to do at the moment, as Encoding
> is a completely opaque object.

Correct me if I am wrong, but even "upper case" as a concept is not
common among all languages - an implementation detail for specific
cases at best.

For example, in German, you may want a more meaningful 'to_noun'
instead of 'capitalize'. For Japanese some may want upcase as a no-op
and some as a hack to convert to katakana. For case insensitivity,
probably a "normalize" method would be more descriptive.

Out of curiosity: in what specific case is utf upcase necessary?

-- 
Cezary Baginski

Updated by Nikolai Weibull 11 months ago

On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote:
> On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:

>> My point is that the current #upcase (and similar methods) is
>> basically useless for anything other than ASCII.

> I would probably go one step further and disallow upcase and friends
> for any non-US-ASCII string for this reason. At least issue a warning.

For Unicode there actually are well-defined casing rules.

> For example, in German, you may want a more meaningful 'to_noun'
> instead of 'capitalize'. For Japanese some may want upcase as a no-op
> and some as a hack to convert to katakana. For case insensitivity,
> probably a "normalize" method would be more descriptive.

This is perhaps true, but beside the point.

> Out of curiosity: in what specific case is utf upcase necessary?

That’s a good question.  It’s perhaps not a common operation, but text
editors and regular expression engines most likely need it.  Even if
their utility is limited, returning incorrect results is worse.

Updated by Carl Hörberg 10 months ago

We need it to allow class names in foreign languages. Today "Åtgärd" ain't recognized as a constant, and there for can't be uses a class name.

Updated by Yui NARUSE 10 months ago

About String#upcase, our current answer is simple: use ICU. https://github.com/jarib/ffi-icu

Updated by Nikolai Weibull 10 months ago

On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
>> (2010/03/26 0:02), Nikolai Weibull wrote:

> I was wondering if there was a way to do it without having to do
>
> String.new.unicodify.upcase

>> But I think, people want both ASCII version and Unicode version of upcase.
>> So you should name your Unicode methods another names.

> Why would they want that?  Having an ASCII-only version of #upcase
> makes no sense for a Unicode String more than supporting #upcase
> requires that you load the Unicode character database information,
> which takes up quite a lot of memory.

So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
absolutely no sense and means that quite a few methods on String are
completely useless in a m18n context.

Updated by Magnus Holm 10 months ago

The problem is that the definition of #upcase doesn't only depend on the
encoding used, but also the language of the encoded text. For instance, if
you're writing in Turkish, you would expect "i".upcase to return a dotted
uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

<http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is
*really* hard and needs to have a lot of flexibility, especially when it
comes to non-Western languages. It's far easier for everyone that the
built-in #upcase is simple and fast and you'll have to be explicit about any
other I18n stuff IMO.

// Magnus Holm


On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote:

> On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
> >> (2010/03/26 0:02), Nikolai Weibull wrote:
>
> > I was wondering if there was a way to do it without having to do
> >
> > String.new.unicodify.upcase
>
> >> But I think, people want both ASCII version and Unicode version of
> upcase.
> >> So you should name your Unicode methods another names.
>
> > Why would they want that?  Having an ASCII-only version of #upcase
> > makes no sense for a Unicode String more than supporting #upcase
> > requires that you load the Unicode character database information,
> > which takes up quite a lot of memory.
>
> So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
> absolutely no sense and means that quite a few methods on String are
> completely useless in a m18n context.
>
>

Updated by Nikolai Weibull 10 months ago

On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> The problem is that the definition of #upcase doesn't only depend on the
> encoding used, but also the language of the encoded text. For instance, if
> you're writing in Turkish, you would expect "i".upcase to return a dotted
> uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

I know.  The same goes for ‘i’ in Lithuanian.

> Doing this properly is *really* hard and needs to have a lot of flexibility,
> especially when it comes to non-Western languages.

This is simply not true.  Unicode defines how to deal with case
conversions.  I’m not saying that the Unicode standard is infallible,
but we can at least adhere to it.  I’m not saying that Unicode is the
only encoding that we should care about, but if we support the Unicode
transfer formats, why not support other interesting parts of the
standard?

> It's far easier for everyone that the built-in #upcase is
> simple and fast and you'll have to be explicit about any
> other I18n stuff IMO.

Easy, perhaps, but hardly useful.

My point is that the current #upcase (and similar methods) is
basically useless for anything other than ASCII.  I was looking for an
actual solution to this problem.  I have a library
(character-encodings) that does support these conversions, based on
locale and the Unicode character database (UCD).  How do we make it
easy for the user to deal with m18n?  I mean, if I say

# -*- coding: utf-8 -*-

puts "äbc".upcase

I expect this to do the right thing for Unicode under the current locale.

As Unicode defines how to deal with case conversions, if I tell Ruby
that “this String is encoded as UTF-8” (or, in this case, “strings in
this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll
use the Unicode rules that govern methods like #upcase for that
String”.

The UCD requires a lot of memory, so I suggested that a library, such
as character-encodings, should be able to seamlessly add this kind of
behavior without requiring the user to write "äbc".unicodify.upcase,
if the UCD can’t be included in standard Ruby runtime.

But, come to think of it, doesn’t Oniguruma need most of the UCD
information, so isn’t most of it already included in the Ruby runtime?
 Adding casing information perhaps wouldn’t require much additional
space.

If this isn’t of interest, then I’m still looking for a way to
override #upcase for Strings that use the UTF-8 encoding without
resorting to alias_method or extend (as shown earlier in this
discussion).  This seems impossible to do at the moment, as Encoding
is a completely opaque object.

Updated by Cezary Baginski 10 months ago

On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:
> On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> > It's far easier for everyone that the built-in #upcase is
> > simple and fast and you'll have to be explicit about any
> > other I18n stuff IMO.
>
> Easy, perhaps, but hardly useful.

A agree - for human interaction it is completely useless. I tend to
think of #upcase as just a convenience method for dealing with ASCII
only system level functionality, e.g. paths on filesystems,
environment variables, html tags, (un)capitalizing to get class names,
database table names, etc.

Anything else is "no-op" or "undefined" for me.

> My point is that the current #upcase (and similar methods) is
> basically useless for anything other than ASCII.

I would probably go one step further and disallow upcase and friends
for any non-US-ASCII string for this reason. At least issue a warning.

> If this isn’t of interest, then I’m still looking for a way to
> override #upcase for Strings that use the UTF-8 encoding without
> resorting to alias_method or extend (as shown earlier in this
> discussion).  This seems impossible to do at the moment, as Encoding
> is a completely opaque object.

Correct me if I am wrong, but even "upper case" as a concept is not
common among all languages - an implementation detail for specific
cases at best.

For example, in German, you may want a more meaningful 'to_noun'
instead of 'capitalize'. For Japanese some may want upcase as a no-op
and some as a hack to convert to katakana. For case insensitivity,
probably a "normalize" method would be more descriptive.

Out of curiosity: in what specific case is utf upcase necessary?

-- 
Cezary Baginski

Updated by Nikolai Weibull 10 months ago

On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote:
> On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:

>> My point is that the current #upcase (and similar methods) is
>> basically useless for anything other than ASCII.

> I would probably go one step further and disallow upcase and friends
> for any non-US-ASCII string for this reason. At least issue a warning.

For Unicode there actually are well-defined casing rules.

> For example, in German, you may want a more meaningful 'to_noun'
> instead of 'capitalize'. For Japanese some may want upcase as a no-op
> and some as a hack to convert to katakana. For case insensitivity,
> probably a "normalize" method would be more descriptive.

This is perhaps true, but beside the point.

> Out of curiosity: in what specific case is utf upcase necessary?

That’s a good question.  It’s perhaps not a common operation, but text
editors and regular expression engines most likely need it.  Even if
their utility is limited, returning incorrect results is worse.

Updated by Nikolai Weibull 10 months ago

On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
>> (2010/03/26 0:02), Nikolai Weibull wrote:

> I was wondering if there was a way to do it without having to do
>
> String.new.unicodify.upcase

>> But I think, people want both ASCII version and Unicode version of upcase.
>> So you should name your Unicode methods another names.

> Why would they want that?  Having an ASCII-only version of #upcase
> makes no sense for a Unicode String more than supporting #upcase
> requires that you load the Unicode character database information,
> which takes up quite a lot of memory.

So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
absolutely no sense and means that quite a few methods on String are
completely useless in a m18n context.

Updated by Magnus Holm 10 months ago

The problem is that the definition of #upcase doesn't only depend on the
encoding used, but also the language of the encoded text. For instance, if
you're writing in Turkish, you would expect "i".upcase to return a dotted
uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

<http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is
*really* hard and needs to have a lot of flexibility, especially when it
comes to non-Western languages. It's far easier for everyone that the
built-in #upcase is simple and fast and you'll have to be explicit about any
other I18n stuff IMO.

// Magnus Holm


On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote:

> On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote:
> > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
> >> (2010/03/26 0:02), Nikolai Weibull wrote:
>
> > I was wondering if there was a way to do it without having to do
> >
> > String.new.unicodify.upcase
>
> >> But I think, people want both ASCII version and Unicode version of
> upcase.
> >> So you should name your Unicode methods another names.
>
> > Why would they want that?  Having an ASCII-only version of #upcase
> > makes no sense for a Unicode String more than supporting #upcase
> > requires that you load the Unicode character database information,
> > which takes up quite a lot of memory.
>
> So, what’s the reasoning here?  Having "äbc".upcase return "äBC" makes
> absolutely no sense and means that quite a few methods on String are
> completely useless in a m18n context.
>
>

Updated by Nikolai Weibull 10 months ago

On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> The problem is that the definition of #upcase doesn't only depend on the
> encoding used, but also the language of the encoded text. For instance, if
> you're writing in Turkish, you would expect "i".upcase to return a dotted
> uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html

I know.  The same goes for ‘i’ in Lithuanian.

> Doing this properly is *really* hard and needs to have a lot of flexibility,
> especially when it comes to non-Western languages.

This is simply not true.  Unicode defines how to deal with case
conversions.  I’m not saying that the Unicode standard is infallible,
but we can at least adhere to it.  I’m not saying that Unicode is the
only encoding that we should care about, but if we support the Unicode
transfer formats, why not support other interesting parts of the
standard?

> It's far easier for everyone that the built-in #upcase is
> simple and fast and you'll have to be explicit about any
> other I18n stuff IMO.

Easy, perhaps, but hardly useful.

My point is that the current #upcase (and similar methods) is
basically useless for anything other than ASCII.  I was looking for an
actual solution to this problem.  I have a library
(character-encodings) that does support these conversions, based on
locale and the Unicode character database (UCD).  How do we make it
easy for the user to deal with m18n?  I mean, if I say

# -*- coding: utf-8 -*-

puts "äbc".upcase

I expect this to do the right thing for Unicode under the current locale.

As Unicode defines how to deal with case conversions, if I tell Ruby
that “this String is encoded as UTF-8” (or, in this case, “strings in
this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll
use the Unicode rules that govern methods like #upcase for that
String”.

The UCD requires a lot of memory, so I suggested that a library, such
as character-encodings, should be able to seamlessly add this kind of
behavior without requiring the user to write "äbc".unicodify.upcase,
if the UCD can’t be included in standard Ruby runtime.

But, come to think of it, doesn’t Oniguruma need most of the UCD
information, so isn’t most of it already included in the Ruby runtime?
 Adding casing information perhaps wouldn’t require much additional
space.

If this isn’t of interest, then I’m still looking for a way to
override #upcase for Strings that use the UTF-8 encoding without
resorting to alias_method or extend (as shown earlier in this
discussion).  This seems impossible to do at the moment, as Encoding
is a completely opaque object.

Updated by Cezary Baginski 10 months ago

On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:
> On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote:
> > It's far easier for everyone that the built-in #upcase is
> > simple and fast and you'll have to be explicit about any
> > other I18n stuff IMO.
>
> Easy, perhaps, but hardly useful.

A agree - for human interaction it is completely useless. I tend to
think of #upcase as just a convenience method for dealing with ASCII
only system level functionality, e.g. paths on filesystems,
environment variables, html tags, (un)capitalizing to get class names,
database table names, etc.

Anything else is "no-op" or "undefined" for me.

> My point is that the current #upcase (and similar methods) is
> basically useless for anything other than ASCII.

I would probably go one step further and disallow upcase and friends
for any non-US-ASCII string for this reason. At least issue a warning.

> If this isn’t of interest, then I’m still looking for a way to
> override #upcase for Strings that use the UTF-8 encoding without
> resorting to alias_method or extend (as shown earlier in this
> discussion).  This seems impossible to do at the moment, as Encoding
> is a completely opaque object.

Correct me if I am wrong, but even "upper case" as a concept is not
common among all languages - an implementation detail for specific
cases at best.

For example, in German, you may want a more meaningful 'to_noun'
instead of 'capitalize'. For Japanese some may want upcase as a no-op
and some as a hack to convert to katakana. For case insensitivity,
probably a "normalize" method would be more descriptive.

Out of curiosity: in what specific case is utf upcase necessary?

-- 
Cezary Baginski

Updated by Nikolai Weibull 10 months ago

On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote:
> On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote:

>> My point is that the current #upcase (and similar methods) is
>> basically useless for anything other than ASCII.

> I would probably go one step further and disallow upcase and friends
> for any non-US-ASCII string for this reason. At least issue a warning.

For Unicode there actually are well-defined casing rules.

> For example, in German, you may want a more meaningful 'to_noun'
> instead of 'capitalize'. For Japanese some may want upcase as a no-op
> and some as a hack to convert to katakana. For case insensitivity,
> probably a "normalize" method would be more descriptive.

This is perhaps true, but beside the point.

> Out of curiosity: in what specific case is utf upcase necessary?

That’s a good question.  It’s perhaps not a common operation, but text
editors and regular expression engines most likely need it.  Even if
their utility is limited, returning incorrect results is worse.

Also available in: Atom PDF