Feature #11814
closed
String#valid_encoding? without force_encoding
Added by usa (Usaku NAKAMURA) over 8 years ago.
Updated over 8 years ago.
Description
Now we have to set a encoding to a string to validate it, just like:
str.force_encoding('euc-jp').valid_encoding? # => true or false
But to modify the string is not so smart.
knu-san requires the way to validate a string without modifiing it [*1].
Then, I propose to add an optional encoding parameter to String#valid_encoding?
.
str.valid_encoding?('euc-jp') # => true or false
A patch is attached.
[*1] https://twitter.com/knu/status/676009662655934465 (in Japanese)
Files
Could you show the use case?
As far as I know, str.force_encoding('euc-jp').valid_encoding?
is sufficient.
Because if it returns invalid, what it should do is only raising error.
- Status changed from Open to Rejected
knu says it is to guess the encoding.
ruby shouldn't help such hard and often misused work.
The first requirement for me was not to modify the original string object, so it should read:
str.dup.force_encoding('euc-jp').valid_encoding?
instead. This would cost one string object allocation just for testing, but the byte array would be shared while keeping the original object intact.
For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.
I gave up with this idea for now because I thought the use cases would not expand as wide as expected and it'd be not enough just to add valid_encoding?(enc) if you got serious about encoding detection. (Sorry usa-san!)
However, since this issue is raised, let me share one good use case for future viewers.
Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.
So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:
POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]
encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
encs.select { |enc| b.valid_encoding?(enc) }
}.first
I agree with Yui.
Akinori MUSHA wrote:
For guessing the possible encodings for a byte stream, there are gems specialized for that purpose like charlock_homes, ucharset and rcharset. They are mostly either a wrapper of LibICU4C or a port of Mozilla's encoding detector.
They also should be faster on long strings, and may use byte/character frequency and other heuristics. And it's clear to the user that this is magic that may fail.
Akinori MUSHA wrote:
Suppose you have a list of byte arrays which you don't know which encoding they are encoded in, like when you want to guess the encoding of the file names stored in a zip file.
So, if you had String#valid_encoding?(enc) you could achieve it like this without modifying, copying or concatenating strings:
POSSIBLE_ENCODINGS = [Encoding::UTF_8, Encoding::Windows_31J, Encoding::ISO_8859_1, Encoding::ASCII_8BIT]
encoding = byte_arrays.inject(POSSIBLE_ENCODINGS) { |encs, b|
encs.select { |enc| b.valid_encoding?(enc) }
}.first
A few comments on this program:
-
Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.
-
Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.
-
There are many more encodings, but distinguishing them is difficult/impossible with this method.
Martin Dürst wrote:
A few comments on this program:
Thanks!
- Encoding::ASCII_8BIT will pick up garbage. Encoding::US_ASCII is much better.
Maybe not. You could choose to perform the CAP encoding when the encoding was unknown (ASCII_8BIT), or just use the binary garbage as is if the storage was capable of saving binary file names (like ZFS).
- Encoding::ISO_8859_1 is always valid, for all bytes, so ASCII8BIT (or US-ASCII) never get used.
Ah, so true. It’s my bad. Anyway I put ASCII_8BIT as a sentinel so encoding
would never be nil, so US_ASCII was not an option.
- There are many more encodings, but distinguishing them is difficult/impossible with this method.
I know, but in most cases you have some idea as to what the possible encodings are and it is sufficient to try just a few encodings in such cases. This example was meant to be one of them.
If you need more, a BOM-based encoding detector could be another use case for valid_encoding?(enc), I don't know.
I already named a few gems for serious use, so please don't be so strict about these casual use cases.
Also available in: Atom
PDF
Like0
Like0Like0Like0Like0Like0Like0Like0Like0