Bug #13321: String#codepoints for one-byte encodings - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #13321

closed

String#codepoints for one-byte encodings

Bug #13321: String#codepoints for one-byte encodings

Added by InfraRuby (InfraRuby Vision) about 9 years ago. Updated about 9 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

Backport:

2.2: UNKNOWN, 2.3: UNKNOWN, 2.4: UNKNOWN

[ruby-core:80194]

Description

On many versions of Ruby, including 2.4.0:

"\x80".force_encoding("WINDOWS-1252").codepoints.first # => 0x80

I expected 0x20AC: https://en.wikipedia.org/wiki/Windows-1252

See:
https://github.com/ruby/ruby/blob/v2_4_0/string.c#L7817-L7818
https://github.com/ruby/ruby/blob/v2_4_0/string.c#L422-L424

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#1

Description updated (diff)

Updated by nobu (Nobuyoshi Nakada) about 9 years ago Actions
Copy link
#2 [ruby-core:80198]

Status changed from Open to Rejected

0x20AC is euro sign in Unicode, it is 0x80 in Windows-1252.

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#3 [ruby-core:80200]

That's surprising to me but I can see that. Thanks!

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#4 [ruby-core:80201]

I tried to improve the documentation with r58000. Please tell me if that helps, or if further explanations are needed.

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#5 [ruby-core:80210]

Please update the documentation for String#codepoints too.

String#codepoints does return (Unicode) codepoints for US-ASCII and ISO-8859-1 as those encodings are the basis of Unicode.

Maybe add Encoding#unicode_codepoints? which returns true for these encodings: US-ASCII, ISO-8859-1, UTF-8, UTF-16(BE|LE), UTF-32(BE|LE).

(Also, there's an unrelated change in that revision.)

Updated by stomar (Marcus Stollsteimer) about 9 years ago Actions
Copy link
#6 [ruby-core:80212]

@duerst (Martin Dürst), @normal

r58000 accidentally reverts r57997 ("deduplicate static rb_str_format format strings") for string.c.

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#7 [ruby-core:80215]

InfraRuby (InfraRuby Vision) wrote:

Please update the documentation for String#codepoints too.

That says "This is a shorthand for str.each_codepoint.to_a".

String#codepoints does return (Unicode) codepoints for US-ASCII and ISO-8859-1 as those encodings are the basis of Unicode.

Well, yes, and for almost all encodings, the returned values are Unicode code points for the ASCII characters, and for some other encodings, there is a bit more of overlap. I don't think we need to go too much into details.

Maybe add Encoding#unicode_codepoints? which returns true for these encodings: US-ASCII, ISO-8859-1, UTF-8, UTF-16(BE|LE), UTF-32(BE|LE).

There are quite a few other cases where behavior of String methods changes depending on the string's Encoding. I think it would be good to have access to this information, but methods with more general names may be needed.

Anyway, to get Unicode codepoints out of an arbitrary string, string.encode('UTF-8').codepoints will always do the job.

(Also, there's an unrelated change in that revision.)

Yes, thanks for noticing, fixed.

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#8 [ruby-core:80254]

Thanks!

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #13321

String#codepoints for one-byte encodings

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#1

Updated by nobu (Nobuyoshi Nakada) about 9 years ago Actions
Copy link
#2 [ruby-core:80198]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#3 [ruby-core:80200]

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#4 [ruby-core:80201]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#5 [ruby-core:80210]

Updated by stomar (Marcus Stollsteimer) about 9 years ago Actions
Copy link
#6 [ruby-core:80212]

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#7 [ruby-core:80215]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#8 [ruby-core:80254]

Project

General

Profile

Ruby

Custom queries

Bug #13321

String#codepoints for one-byte encodings

Updated by InfraRuby (InfraRuby Vision) about 9 years ago ActionsCopy link #1

Updated by nobu (Nobuyoshi Nakada) about 9 years ago ActionsCopy link #2 [ruby-core:80198]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago ActionsCopy link #3 [ruby-core:80200]

Updated by duerst (Martin Dürst) about 9 years ago ActionsCopy link #4 [ruby-core:80201]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago ActionsCopy link #5 [ruby-core:80210]

Updated by stomar (Marcus Stollsteimer) about 9 years ago ActionsCopy link #6 [ruby-core:80212]

Updated by duerst (Martin Dürst) about 9 years ago ActionsCopy link #7 [ruby-core:80215]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago ActionsCopy link #8 [ruby-core:80254]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#1

Updated by nobu (Nobuyoshi Nakada) about 9 years ago Actions
Copy link
#2 [ruby-core:80198]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#3 [ruby-core:80200]

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#4 [ruby-core:80201]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#5 [ruby-core:80210]

Updated by stomar (Marcus Stollsteimer) about 9 years ago Actions
Copy link
#6 [ruby-core:80212]

Updated by duerst (Martin Dürst) about 9 years ago Actions
Copy link
#7 [ruby-core:80215]

Updated by InfraRuby (InfraRuby Vision) about 9 years ago Actions
Copy link
#8 [ruby-core:80254]