Feature #2350
Unicode specific functionality on String in 1.9
| Status: | Rejected | Start date: | 11/09/2009 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | - | |||
| Target version: | - |
Description
I was wondering is there are any plans to include Unicode aware methods for Unicode encodings on String? For example, upcase and downcase only handle ASCII characters at the moment. cafe = "Café" cafe.encoding # => #<Encoding:UTF-8> "Café".upcase # => CAFé
Related issues
| related to ruby-trunk - Feature #2034: Consider the ICU Library for Improving and Expanding Unic... | Assigned | 09/03/2009 | ||
| related to ruby-trunk - Bug #4549: Can't start class names with non us-ascii chars | Rejected | 04/02/2011 |
History
Updated by Yukihiro Matsumoto about 2 years ago
Hi, In message "Re: [ruby-core:26650] [Feature #2350] Unicode specific functionality on String in 1.9" on Mon, 9 Nov 2009 23:29:42 +0900, Manfred Stienstra <redmine@ruby-lang.org> writes: |I was wondering is there are any plans to include Unicode aware methods for Unicode encodings on String? For example, upcase and downcase only handle ASCII characters at the moment. | |cafe = "Café" |cafe.encoding # => #<Encoding:UTF-8> |"Café".upcase # => CAFé As far as I understand, the Unicode case conversion requires additional language information for e.g. Turkish i. And some conversion does not round-trip e.g. German SS. Use unicode gem instead. matz.
Updated by Manfred Stienstra about 2 years ago
Yes, case conversions require the Unicode database and specific locale implementations. Thank you for your answer!
Updated by Yusuke Endoh almost 2 years ago
- Status changed from Open to Rejected
Hi, Matz seemed to reject this ticket, and OP seemed to be satisfied with matz's answer. So I close the ticket. -- Yusuke Endoh <mame@tsg.ne.jp>
Updated by Nikolai Weibull almost 2 years ago
On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh <redmine@ruby-lang.org> wrote: > Issue #2350 has been updated by Yusuke Endoh. > Matz seemed to reject this ticket, and OP seemed to be satisfied > with matz's answer. So I close the ticket. How would I be able to hook in my character-encodings library into Ruby 1.9 Strings? I would like to override, for example, #upcase for all Strings that have a Unicode encoding. Is this possible? Thanks!
Updated by Yui NARUSE almost 2 years ago
(2010/03/26 0:02), Nikolai Weibull wrote: > On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh<redmine@ruby-lang.org> wrote: >> Issue #2350 has been updated by Yusuke Endoh. > >> Matz seemed to reject this ticket, and OP seemed to be satisfied >> with matz's answer. So I close the ticket. > > How would I be able to hook in my character-encodings library into > Ruby 1.9 Strings? I would like to override, for example, #upcase for > all Strings that have a Unicode encoding. Is this possible? You can hook String methods, Ruby doesn't forbid it. But I think, people want both ASCII version and Unicode version of upcase. So you should name your Unicode methods another names. -- NARUSE, Yui <naruse@airemix.jp>
Updated by Nikolai Weibull almost 2 years ago
On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote:
> (2010/03/26 0:02), Nikolai Weibull wrote:
>>
>> On Thu, Mar 25, 2010 at 14:45, Yusuke Endoh<redmine@ruby-lang.org> wrote:
>>>
>>> Issue #2350 has been updated by Yusuke Endoh.
>>
>>> Matz seemed to reject this ticket, and OP seemed to be satisfied
>>> with matz's answer. So I close the ticket.
>>
>> How would I be able to hook in my character-encodings library into
>> Ruby 1.9 Strings? I would like to override, for example, #upcase for
>> all Strings that have a Unicode encoding. Is this possible?
>
> You can hook String methods, Ruby doesn't forbid it.
Yes, I can do something like
class String
def unicodify
extend Encoding::Character::Unicode
end
end
but I was wondering if there was a way to do it without having to do
String.new.unicodify.upcase
> But I think, people want both ASCII version and Unicode version of upcase.
> So you should name your Unicode methods another names.
Why would they want that? Having an ASCII-only version of #upcase
makes no sense for a Unicode String more than supporting #upcase
requires that you load the Unicode character database information,
which takes up quite a lot of memory.
I want to transparently deal with this kind of thing. I know that the
Ruby way is to be explicit about encodings and I actually like that,
but that’s only something I care about at creation, not when invoking
methods on the String.
Updated by Nikolai Weibull 11 months ago
On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: >> (2010/03/26 0:02), Nikolai Weibull wrote: > I was wondering if there was a way to do it without having to do > > String.new.unicodify.upcase >> But I think, people want both ASCII version and Unicode version of upcase. >> So you should name your Unicode methods another names. > Why would they want that? Having an ASCII-only version of #upcase > makes no sense for a Unicode String more than supporting #upcase > requires that you load the Unicode character database information, > which takes up quite a lot of memory. So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes absolutely no sense and means that quite a few methods on String are completely useless in a m18n context.
Updated by Magnus Holm 11 months ago
The problem is that the definition of #upcase doesn't only depend on the encoding used, but also the language of the encoded text. For instance, if you're writing in Turkish, you would expect "i".upcase to return a dotted uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html <http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is *really* hard and needs to have a lot of flexibility, especially when it comes to non-Western languages. It's far easier for everyone that the built-in #upcase is simple and fast and you'll have to be explicit about any other I18n stuff IMO. // Magnus Holm On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: > >> (2010/03/26 0:02), Nikolai Weibull wrote: > > > I was wondering if there was a way to do it without having to do > > > > String.new.unicodify.upcase > > >> But I think, people want both ASCII version and Unicode version of > upcase. > >> So you should name your Unicode methods another names. > > > Why would they want that? Having an ASCII-only version of #upcase > > makes no sense for a Unicode String more than supporting #upcase > > requires that you load the Unicode character database information, > > which takes up quite a lot of memory. > > So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes > absolutely no sense and means that quite a few methods on String are > completely useless in a m18n context. > >
Updated by Nikolai Weibull 11 months ago
On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > The problem is that the definition of #upcase doesn't only depend on the > encoding used, but also the language of the encoded text. For instance, if > you're writing in Turkish, you would expect "i".upcase to return a dotted > uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html I know. The same goes for ‘i’ in Lithuanian. > Doing this properly is *really* hard and needs to have a lot of flexibility, > especially when it comes to non-Western languages. This is simply not true. Unicode defines how to deal with case conversions. I’m not saying that the Unicode standard is infallible, but we can at least adhere to it. I’m not saying that Unicode is the only encoding that we should care about, but if we support the Unicode transfer formats, why not support other interesting parts of the standard? > It's far easier for everyone that the built-in #upcase is > simple and fast and you'll have to be explicit about any > other I18n stuff IMO. Easy, perhaps, but hardly useful. My point is that the current #upcase (and similar methods) is basically useless for anything other than ASCII. I was looking for an actual solution to this problem. I have a library (character-encodings) that does support these conversions, based on locale and the Unicode character database (UCD). How do we make it easy for the user to deal with m18n? I mean, if I say # -*- coding: utf-8 -*- puts "äbc".upcase I expect this to do the right thing for Unicode under the current locale. As Unicode defines how to deal with case conversions, if I tell Ruby that “this String is encoded as UTF-8” (or, in this case, “strings in this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll use the Unicode rules that govern methods like #upcase for that String”. The UCD requires a lot of memory, so I suggested that a library, such as character-encodings, should be able to seamlessly add this kind of behavior without requiring the user to write "äbc".unicodify.upcase, if the UCD can’t be included in standard Ruby runtime. But, come to think of it, doesn’t Oniguruma need most of the UCD information, so isn’t most of it already included in the Ruby runtime? Adding casing information perhaps wouldn’t require much additional space. If this isn’t of interest, then I’m still looking for a way to override #upcase for Strings that use the UTF-8 encoding without resorting to alias_method or extend (as shown earlier in this discussion). This seems impossible to do at the moment, as Encoding is a completely opaque object.
Updated by Cezary Baginski 11 months ago
On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: > On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > > It's far easier for everyone that the built-in #upcase is > > simple and fast and you'll have to be explicit about any > > other I18n stuff IMO. > > Easy, perhaps, but hardly useful. A agree - for human interaction it is completely useless. I tend to think of #upcase as just a convenience method for dealing with ASCII only system level functionality, e.g. paths on filesystems, environment variables, html tags, (un)capitalizing to get class names, database table names, etc. Anything else is "no-op" or "undefined" for me. > My point is that the current #upcase (and similar methods) is > basically useless for anything other than ASCII. I would probably go one step further and disallow upcase and friends for any non-US-ASCII string for this reason. At least issue a warning. > If this isn’t of interest, then I’m still looking for a way to > override #upcase for Strings that use the UTF-8 encoding without > resorting to alias_method or extend (as shown earlier in this > discussion). This seems impossible to do at the moment, as Encoding > is a completely opaque object. Correct me if I am wrong, but even "upper case" as a concept is not common among all languages - an implementation detail for specific cases at best. For example, in German, you may want a more meaningful 'to_noun' instead of 'capitalize'. For Japanese some may want upcase as a no-op and some as a hack to convert to katakana. For case insensitivity, probably a "normalize" method would be more descriptive. Out of curiosity: in what specific case is utf upcase necessary? -- Cezary Baginski
Updated by Nikolai Weibull 11 months ago
On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote: > On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: >> My point is that the current #upcase (and similar methods) is >> basically useless for anything other than ASCII. > I would probably go one step further and disallow upcase and friends > for any non-US-ASCII string for this reason. At least issue a warning. For Unicode there actually are well-defined casing rules. > For example, in German, you may want a more meaningful 'to_noun' > instead of 'capitalize'. For Japanese some may want upcase as a no-op > and some as a hack to convert to katakana. For case insensitivity, > probably a "normalize" method would be more descriptive. This is perhaps true, but beside the point. > Out of curiosity: in what specific case is utf upcase necessary? That’s a good question. It’s perhaps not a common operation, but text editors and regular expression engines most likely need it. Even if their utility is limited, returning incorrect results is worse.
Updated by Carl Hörberg 10 months ago
We need it to allow class names in foreign languages. Today "Åtgärd" ain't recognized as a constant, and there for can't be uses a class name.
Updated by Yui NARUSE 10 months ago
About String#upcase, our current answer is simple: use ICU. https://github.com/jarib/ffi-icu
Updated by Nikolai Weibull 10 months ago
On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: >> (2010/03/26 0:02), Nikolai Weibull wrote: > I was wondering if there was a way to do it without having to do > > String.new.unicodify.upcase >> But I think, people want both ASCII version and Unicode version of upcase. >> So you should name your Unicode methods another names. > Why would they want that? Having an ASCII-only version of #upcase > makes no sense for a Unicode String more than supporting #upcase > requires that you load the Unicode character database information, > which takes up quite a lot of memory. So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes absolutely no sense and means that quite a few methods on String are completely useless in a m18n context.
Updated by Magnus Holm 10 months ago
The problem is that the definition of #upcase doesn't only depend on the encoding used, but also the language of the encoded text. For instance, if you're writing in Turkish, you would expect "i".upcase to return a dotted uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html <http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is *really* hard and needs to have a lot of flexibility, especially when it comes to non-Western languages. It's far easier for everyone that the built-in #upcase is simple and fast and you'll have to be explicit about any other I18n stuff IMO. // Magnus Holm On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: > >> (2010/03/26 0:02), Nikolai Weibull wrote: > > > I was wondering if there was a way to do it without having to do > > > > String.new.unicodify.upcase > > >> But I think, people want both ASCII version and Unicode version of > upcase. > >> So you should name your Unicode methods another names. > > > Why would they want that? Having an ASCII-only version of #upcase > > makes no sense for a Unicode String more than supporting #upcase > > requires that you load the Unicode character database information, > > which takes up quite a lot of memory. > > So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes > absolutely no sense and means that quite a few methods on String are > completely useless in a m18n context. > >
Updated by Nikolai Weibull 10 months ago
On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > The problem is that the definition of #upcase doesn't only depend on the > encoding used, but also the language of the encoded text. For instance, if > you're writing in Turkish, you would expect "i".upcase to return a dotted > uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html I know. The same goes for ‘i’ in Lithuanian. > Doing this properly is *really* hard and needs to have a lot of flexibility, > especially when it comes to non-Western languages. This is simply not true. Unicode defines how to deal with case conversions. I’m not saying that the Unicode standard is infallible, but we can at least adhere to it. I’m not saying that Unicode is the only encoding that we should care about, but if we support the Unicode transfer formats, why not support other interesting parts of the standard? > It's far easier for everyone that the built-in #upcase is > simple and fast and you'll have to be explicit about any > other I18n stuff IMO. Easy, perhaps, but hardly useful. My point is that the current #upcase (and similar methods) is basically useless for anything other than ASCII. I was looking for an actual solution to this problem. I have a library (character-encodings) that does support these conversions, based on locale and the Unicode character database (UCD). How do we make it easy for the user to deal with m18n? I mean, if I say # -*- coding: utf-8 -*- puts "äbc".upcase I expect this to do the right thing for Unicode under the current locale. As Unicode defines how to deal with case conversions, if I tell Ruby that “this String is encoded as UTF-8” (or, in this case, “strings in this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll use the Unicode rules that govern methods like #upcase for that String”. The UCD requires a lot of memory, so I suggested that a library, such as character-encodings, should be able to seamlessly add this kind of behavior without requiring the user to write "äbc".unicodify.upcase, if the UCD can’t be included in standard Ruby runtime. But, come to think of it, doesn’t Oniguruma need most of the UCD information, so isn’t most of it already included in the Ruby runtime? Adding casing information perhaps wouldn’t require much additional space. If this isn’t of interest, then I’m still looking for a way to override #upcase for Strings that use the UTF-8 encoding without resorting to alias_method or extend (as shown earlier in this discussion). This seems impossible to do at the moment, as Encoding is a completely opaque object.
Updated by Cezary Baginski 10 months ago
On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: > On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > > It's far easier for everyone that the built-in #upcase is > > simple and fast and you'll have to be explicit about any > > other I18n stuff IMO. > > Easy, perhaps, but hardly useful. A agree - for human interaction it is completely useless. I tend to think of #upcase as just a convenience method for dealing with ASCII only system level functionality, e.g. paths on filesystems, environment variables, html tags, (un)capitalizing to get class names, database table names, etc. Anything else is "no-op" or "undefined" for me. > My point is that the current #upcase (and similar methods) is > basically useless for anything other than ASCII. I would probably go one step further and disallow upcase and friends for any non-US-ASCII string for this reason. At least issue a warning. > If this isn’t of interest, then I’m still looking for a way to > override #upcase for Strings that use the UTF-8 encoding without > resorting to alias_method or extend (as shown earlier in this > discussion). This seems impossible to do at the moment, as Encoding > is a completely opaque object. Correct me if I am wrong, but even "upper case" as a concept is not common among all languages - an implementation detail for specific cases at best. For example, in German, you may want a more meaningful 'to_noun' instead of 'capitalize'. For Japanese some may want upcase as a no-op and some as a hack to convert to katakana. For case insensitivity, probably a "normalize" method would be more descriptive. Out of curiosity: in what specific case is utf upcase necessary? -- Cezary Baginski
Updated by Nikolai Weibull 10 months ago
On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote: > On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: >> My point is that the current #upcase (and similar methods) is >> basically useless for anything other than ASCII. > I would probably go one step further and disallow upcase and friends > for any non-US-ASCII string for this reason. At least issue a warning. For Unicode there actually are well-defined casing rules. > For example, in German, you may want a more meaningful 'to_noun' > instead of 'capitalize'. For Japanese some may want upcase as a no-op > and some as a hack to convert to katakana. For case insensitivity, > probably a "normalize" method would be more descriptive. This is perhaps true, but beside the point. > Out of curiosity: in what specific case is utf upcase necessary? That’s a good question. It’s perhaps not a common operation, but text editors and regular expression engines most likely need it. Even if their utility is limited, returning incorrect results is worse.
Updated by Nikolai Weibull 10 months ago
On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: >> (2010/03/26 0:02), Nikolai Weibull wrote: > I was wondering if there was a way to do it without having to do > > String.new.unicodify.upcase >> But I think, people want both ASCII version and Unicode version of upcase. >> So you should name your Unicode methods another names. > Why would they want that? Having an ASCII-only version of #upcase > makes no sense for a Unicode String more than supporting #upcase > requires that you load the Unicode character database information, > which takes up quite a lot of memory. So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes absolutely no sense and means that quite a few methods on String are completely useless in a m18n context.
Updated by Magnus Holm 10 months ago
The problem is that the definition of #upcase doesn't only depend on the encoding used, but also the language of the encoded text. For instance, if you're writing in Turkish, you would expect "i".upcase to return a dotted uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html <http://www.i18nguy.com/unicode/turkish-i18n.html>Doing this properly is *really* hard and needs to have a lot of flexibility, especially when it comes to non-Western languages. It's far easier for everyone that the built-in #upcase is simple and fast and you'll have to be explicit about any other I18n stuff IMO. // Magnus Holm On Fri, Mar 18, 2011 at 11:19, Nikolai Weibull <now@bitwi.se> wrote: > On Thu, Mar 25, 2010 at 19:33, Nikolai Weibull <now@bitwi.se> wrote: > > On Thu, Mar 25, 2010 at 18:24, NARUSE, Yui <naruse@airemix.jp> wrote: > >> (2010/03/26 0:02), Nikolai Weibull wrote: > > > I was wondering if there was a way to do it without having to do > > > > String.new.unicodify.upcase > > >> But I think, people want both ASCII version and Unicode version of > upcase. > >> So you should name your Unicode methods another names. > > > Why would they want that? Having an ASCII-only version of #upcase > > makes no sense for a Unicode String more than supporting #upcase > > requires that you load the Unicode character database information, > > which takes up quite a lot of memory. > > So, what’s the reasoning here? Having "äbc".upcase return "äBC" makes > absolutely no sense and means that quite a few methods on String are > completely useless in a m18n context. > >
Updated by Nikolai Weibull 10 months ago
On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > The problem is that the definition of #upcase doesn't only depend on the > encoding used, but also the language of the encoded text. For instance, if > you're writing in Turkish, you would expect "i".upcase to return a dotted > uppcase I: http://www.i18nguy.com/unicode/turkish-i18n.html I know. The same goes for ‘i’ in Lithuanian. > Doing this properly is *really* hard and needs to have a lot of flexibility, > especially when it comes to non-Western languages. This is simply not true. Unicode defines how to deal with case conversions. I’m not saying that the Unicode standard is infallible, but we can at least adhere to it. I’m not saying that Unicode is the only encoding that we should care about, but if we support the Unicode transfer formats, why not support other interesting parts of the standard? > It's far easier for everyone that the built-in #upcase is > simple and fast and you'll have to be explicit about any > other I18n stuff IMO. Easy, perhaps, but hardly useful. My point is that the current #upcase (and similar methods) is basically useless for anything other than ASCII. I was looking for an actual solution to this problem. I have a library (character-encodings) that does support these conversions, based on locale and the Unicode character database (UCD). How do we make it easy for the user to deal with m18n? I mean, if I say # -*- coding: utf-8 -*- puts "äbc".upcase I expect this to do the right thing for Unicode under the current locale. As Unicode defines how to deal with case conversions, if I tell Ruby that “this String is encoded as UTF-8” (or, in this case, “strings in this file are encoded as UTF-8”), I expect Ruby to respond “OK, I’ll use the Unicode rules that govern methods like #upcase for that String”. The UCD requires a lot of memory, so I suggested that a library, such as character-encodings, should be able to seamlessly add this kind of behavior without requiring the user to write "äbc".unicodify.upcase, if the UCD can’t be included in standard Ruby runtime. But, come to think of it, doesn’t Oniguruma need most of the UCD information, so isn’t most of it already included in the Ruby runtime? Adding casing information perhaps wouldn’t require much additional space. If this isn’t of interest, then I’m still looking for a way to override #upcase for Strings that use the UTF-8 encoding without resorting to alias_method or extend (as shown earlier in this discussion). This seems impossible to do at the moment, as Encoding is a completely opaque object.
Updated by Cezary Baginski 10 months ago
On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: > On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr@gmail.com> wrote: > > It's far easier for everyone that the built-in #upcase is > > simple and fast and you'll have to be explicit about any > > other I18n stuff IMO. > > Easy, perhaps, but hardly useful. A agree - for human interaction it is completely useless. I tend to think of #upcase as just a convenience method for dealing with ASCII only system level functionality, e.g. paths on filesystems, environment variables, html tags, (un)capitalizing to get class names, database table names, etc. Anything else is "no-op" or "undefined" for me. > My point is that the current #upcase (and similar methods) is > basically useless for anything other than ASCII. I would probably go one step further and disallow upcase and friends for any non-US-ASCII string for this reason. At least issue a warning. > If this isn’t of interest, then I’m still looking for a way to > override #upcase for Strings that use the UTF-8 encoding without > resorting to alias_method or extend (as shown earlier in this > discussion). This seems impossible to do at the moment, as Encoding > is a completely opaque object. Correct me if I am wrong, but even "upper case" as a concept is not common among all languages - an implementation detail for specific cases at best. For example, in German, you may want a more meaningful 'to_noun' instead of 'capitalize'. For Japanese some may want upcase as a no-op and some as a hack to convert to katakana. For case insensitivity, probably a "normalize" method would be more descriptive. Out of curiosity: in what specific case is utf upcase necessary? -- Cezary Baginski
Updated by Nikolai Weibull 10 months ago
On Tue, Mar 22, 2011 at 18:30, Cezary <cezary.baginski@gmail.com> wrote: > On Fri, Mar 18, 2011 at 09:52:27PM +0900, Nikolai Weibull wrote: >> My point is that the current #upcase (and similar methods) is >> basically useless for anything other than ASCII. > I would probably go one step further and disallow upcase and friends > for any non-US-ASCII string for this reason. At least issue a warning. For Unicode there actually are well-defined casing rules. > For example, in German, you may want a more meaningful 'to_noun' > instead of 'capitalize'. For Japanese some may want upcase as a no-op > and some as a hack to convert to katakana. For case insensitivity, > probably a "normalize" method would be more descriptive. This is perhaps true, but beside the point. > Out of curiosity: in what specific case is utf upcase necessary? That’s a good question. It’s perhaps not a common operation, but text editors and regular expression engines most likely need it. Even if their utility is limited, returning incorrect results is worse.