Feature #11814: String#valid_encoding? without force_encoding - Ruby - Ruby Issue Tracking System

Custom queries

Backport 3.2
Backport 3.3
Backport 3.4
Backport 4.0
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Feature #11814

closed

String#valid_encoding? without force_encoding

Feature #11814: String#valid_encoding? without force_encoding

Added by usa (Usaku NAKAMURA) over 10 years ago. Updated over 10 years ago.

Status:

Rejected

Assignee:

Target version:

[ruby-core:72098]

Description

Now we have to set a encoding to a string to validate it, just like:

str.force_encoding('euc-jp').valid_encoding?  # => true or false

But to modify the string is not so smart.
knu-san requires the way to validate a string without modifiing it [*1].

Then, I propose to add an optional encoding parameter to String#valid_encoding?.

str.valid_encoding?('euc-jp')  # => true or false

A patch is attached.

[*1] https://twitter.com/knu/status/676009662655934465 (in Japanese)

Files

valid_encoding.patch (4.4 KB) valid_encoding.patch

usa (Usaku NAKAMURA), 12/13/2015 12:38 PM

History
Notes
Property changes

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#1 [ruby-core:72103]

Could you show the use case?

As far as I know, str.force_encoding('euc-jp').valid_encoding? is sufficient.
Because if it returns invalid, what it should do is only raising error.

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#2 [ruby-core:72104]

Status changed from Open to Rejected

knu says it is to guess the encoding.

ruby shouldn't help such hard and often misused work.

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#3 [ruby-core:72105]

The first requirement for me was not to modify the original string object, so it should read:

str.dup.force_encoding('euc-jp').valid_encoding?

instead. This would cost one string object allocation just for testing, but the byte array would be shared while keeping the original object intact.

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#4 [ruby-core:72106]

For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#5 [ruby-core:72108]

I gave up with this idea for now because I thought the use cases would not expand as wide as expected and it'd be not enough just to add valid_encoding?(enc) if you got serious about encoding detection. (Sorry usa-san!)

However, since this issue is raised, let me share one good use case for future viewers.

Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.

So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:

POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]

encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
  encs.select { |enc| b.valid_encoding?(enc) }
}.first

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#6 [ruby-core:72109]

I agree with Yui.

Akinori MUSHA wrote:

For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.

They also should be faster on long strings, and may use byte/character frequency and other heuristics. And it's clear to the user that this is magic that may fail.

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#7 [ruby-core:72110]

Akinori MUSHA wrote:

Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.

So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:
POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]

encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
  encs.select { |enc| b.valid_encoding?(enc) }
}.first

A few comments on this program:

Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.
Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.
There are many more encodings, but distinguishing them is difficult/impossible with this method.

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#8 [ruby-core:72116]

Martin Dürst wrote:

A few comments on this program:

Thanks!

Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.

Maybe not. You could choose to perform the CAP encoding when the encoding was unknown (ASCII_8BIT), or just use the binary garbage as is if the storage was capable of saving binary file names (like ZFS).

Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.

Ah, so true. It’s my bad. Anyway I put ASCII_8BIT as a sentinel so encoding would never be nil, so US_ASCII was not an option.

There are many more encodings, but distinguishing them is difficult/impossible with this method.

I know, but in most cases you have some idea as to what the possible encodings are and it is sufficient to try just a few encodings in such cases. This example was meant to be one of them.

If you need more, a BOM-based encoding detector could be another use case for valid_encoding?(enc), I don't know.

I already named a few gems for serious use, so please don't be so strict about these casual use cases.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #11814

String#valid_encoding? without force_encoding

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#1 [ruby-core:72103]

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#2 [ruby-core:72104]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#3 [ruby-core:72105]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#4 [ruby-core:72106]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#5 [ruby-core:72108]

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#6 [ruby-core:72109]

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#7 [ruby-core:72110]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#8 [ruby-core:72116]

Project

General

Profile

Ruby

Custom queries

Feature #11814

String#valid_encoding? without force_encoding

Updated by naruse (Yui NARUSE) over 10 years ago ActionsCopy link #1 [ruby-core:72103]

Updated by naruse (Yui NARUSE) over 10 years ago ActionsCopy link #2 [ruby-core:72104]

Updated by knu (Akinori MUSHA) over 10 years ago ActionsCopy link #3 [ruby-core:72105]

Updated by knu (Akinori MUSHA) over 10 years ago ActionsCopy link #4 [ruby-core:72106]

Updated by knu (Akinori MUSHA) over 10 years ago ActionsCopy link #5 [ruby-core:72108]

Updated by duerst (Martin Dürst) over 10 years ago ActionsCopy link #6 [ruby-core:72109]

Updated by duerst (Martin Dürst) over 10 years ago ActionsCopy link #7 [ruby-core:72110]

Updated by knu (Akinori MUSHA) over 10 years ago ActionsCopy link #8 [ruby-core:72116]

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#1 [ruby-core:72103]

Updated by naruse (Yui NARUSE) over 10 years ago Actions
Copy link
#2 [ruby-core:72104]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#3 [ruby-core:72105]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#4 [ruby-core:72106]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#5 [ruby-core:72108]

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#6 [ruby-core:72109]

Updated by duerst (Martin Dürst) over 10 years ago Actions
Copy link
#7 [ruby-core:72110]

Updated by knu (Akinori MUSHA) over 10 years ago Actions
Copy link
#8 [ruby-core:72116]