Feature #11814
closedString#valid_encoding? without force_encoding
Description
Now we have to set a encoding to a string to validate it, just like:
str.force_encoding('euc-jp').valid_encoding? # => true or false
But to modify the string is not so smart.
knu-san requires the way to validate a string without modifiing it [*1].
Then, I propose to add an optional encoding parameter to String#valid_encoding?
.
str.valid_encoding?('euc-jp') # => true or false
A patch is attached.
[*1] https://twitter.com/knu/status/676009662655934465 (in Japanese)
Files
Updated by naruse (Yui NARUSE) over 8 years ago
Could you show the use case?
As far as I know, str.force_encoding('euc-jp').valid_encoding?
is sufficient.
Because if it returns invalid, what it should do is only raising error.
Updated by naruse (Yui NARUSE) over 8 years ago
- Status changed from Open to Rejected
knu says it is to guess the encoding.
ruby shouldn't help such hard and often misused work.
Updated by knu (Akinori MUSHA) over 8 years ago
The first requirement for me was not to modify the original string object, so it should read:
str.dup.force_encoding('euc-jp').valid_encoding?
instead. This would cost one string object allocation just for testing, but the byte array would be shared while keeping the original object intact.
Updated by knu (Akinori MUSHA) over 8 years ago
For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.
Updated by knu (Akinori MUSHA) over 8 years ago
I gave up with this idea for now because I thought the use cases would not expand as wide as expected and it'd be not enough just to add valid_encoding?(enc) if you got serious about encoding detection. (Sorry usa-san!)
However, since this issue is raised, let me share one good use case for future viewers.
Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.
So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:
POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]
encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
encs.select { |enc| b.valid_encoding?(enc) }
}.first
Updated by duerst (Martin Dürst) over 8 years ago
I agree with Yui.
Akinori MUSHA wrote:
For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.
They also should be faster on long strings, and may use byte/character frequency and other heuristics. And it's clear to the user that this is magic that may fail.
Updated by duerst (Martin Dürst) over 8 years ago
Akinori MUSHA wrote:
Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.
So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:
POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT] encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b| encs.select { |enc| b.valid_encoding?(enc) } }.first
A few comments on this program:
-
Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.
-
Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.
-
There are many more encodings, but distinguishing them is difficult/impossible with this method.
Updated by knu (Akinori MUSHA) over 8 years ago
Martin Dürst wrote:
A few comments on this program:
Thanks!
- Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.
Maybe not. You could choose to perform the CAP encoding when the encoding was unknown (ASCII_8BIT), or just use the binary garbage as is if the storage was capable of saving binary file names (like ZFS).
- Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.
Ah, so true. It’s my bad. Anyway I put ASCII_8BIT as a sentinel so encoding
would never be nil, so US_ASCII was not an option.
- There are many more encodings, but distinguishing them is difficult/impossible with this method.
I know, but in most cases you have some idea as to what the possible encodings are and it is sufficient to try just a few encodings in such cases. This example was meant to be one of them.
If you need more, a BOM-based encoding detector could be another use case for valid_encoding?(enc), I don't know.
I already named a few gems for serious use, so please don't be so strict about these casual use cases.