Project

General

Profile

Feature #11814

String#valid_encoding? without force_encoding

Added by usa (Usaku NAKAMURA) over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:72098]

Description

Now we have to set a encoding to a string to validate it, just like:

str.force_encoding('euc-jp').valid_encoding?  # => true or false

But to modify the string is not so smart.
knu-san requires the way to validate a string without modifiing it [*1].

Then, I propose to add an optional encoding parameter to String#valid_encoding?.

str.valid_encoding?('euc-jp')  # => true or false

A patch is attached.

[*1] https://twitter.com/knu/status/676009662655934465 (in Japanese)


Files

valid_encoding.patch (4.4 KB) valid_encoding.patch usa (Usaku NAKAMURA), 12/13/2015 12:38 PM

History

Updated by naruse (Yui NARUSE) over 3 years ago

Could you show the use case?

As far as I know, str.force_encoding('euc-jp').valid_encoding? is sufficient.
Because if it returns invalid, what it should do is only raising error.

Updated by naruse (Yui NARUSE) over 3 years ago

  • Status changed from Open to Rejected

knu says it is to guess the encoding.

ruby shouldn't help such hard and often misused work.

Updated by knu (Akinori MUSHA) over 3 years ago

The first requirement for me was not to modify the original string object, so it should read:

str.dup.force_encoding('euc-jp').valid_encoding?

instead. This would cost one string object allocation just for testing, but the byte array would be shared while keeping the original object intact.

Updated by knu (Akinori MUSHA) over 3 years ago

For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.

Updated by knu (Akinori MUSHA) over 3 years ago

I gave up with this idea for now because I thought the use cases would not expand as wide as expected and it'd be not enough just to add valid_encoding?(enc) if you got serious about encoding detection. (Sorry usa-san!)

However, since this issue is raised, let me share one good use case for future viewers.

Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.

So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:

POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]

encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
  encs.select { |enc| b.valid_encoding?(enc) }
}.first

Updated by duerst (Martin Dürst) over 3 years ago

I agree with Yui.

Akinori MUSHA wrote:

For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.

They also should be faster on long strings, and may use byte/character frequency and other heuristics. And it's clear to the user that this is magic that may fail.

Updated by duerst (Martin Dürst) over 3 years ago

Akinori MUSHA wrote:

Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.

So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:

POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]

encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
  encs.select { |enc| b.valid_encoding?(enc) }
}.first

A few comments on this program:

  • Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.

  • Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.

  • There are many more encodings, but distinguishing them is difficult/impossible with this method.

Updated by knu (Akinori MUSHA) over 3 years ago

Martin Dürst wrote:

A few comments on this program:

Thanks!

  • Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.

Maybe not. You could choose to perform the CAP encoding when the encoding was unknown (ASCII_8BIT), or just use the binary garbage as is if the storage was capable of saving binary file names (like ZFS).

  • Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.

Ah, so true. It’s my bad. Anyway I put ASCII_8BIT as a sentinel so encoding would never be nil, so US_ASCII was not an option.

  • There are many more encodings, but distinguishing them is difficult/impossible with this method.

I know, but in most cases you have some idea as to what the possible encodings are and it is sufficient to try just a few encodings in such cases. This example was meant to be one of them.

If you need more, a BOM-based encoding detector could be another use case for valid_encoding?(enc), I don't know.

I already named a few gems for serious use, so please don't be so strict about these casual use cases.

Also available in: Atom PDF