When implementing an IO-like object, I'd like to handle encoding correctly. To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).
I'd like to propose adding instance methods min_char_size, max_char_size, and minmax_char_size to the Encoding class to expose the information stored in the OnigEncodingType struct's min_enc_len and max_enc_len fields.
Seems sensible to me. Guess someone from the ruby core team or matz should chime in and perhaps comment - or someone may do this in the next internal ruby dev team meeting. :)
When implementing an IO-like object, I'd like to handle encoding correctly.
This should be possible without knowing the minimum and maximum length of characters in the encoding.
To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).
I'd like to propose adding instance methods min_char_size, max_char_size, and minmax_char_size to the Encoding class to expose the information stored in the OnigEncodingType struct's min_enc_len and max_enc_len fields.
It may be that there is indeed something that you need for implementing your IO-like object. But my guess is that its most probably something on a higher abstraction level than min_enc_len and max_enc_len. It would be good to have more specific information about what you actually want to do or are doing. I'm changing the status to feedback.
Encoding::UTF_8.max_char_size # => 6
The max length for UTF-8 should actually be 4. I have submitted a separate bug (#13590) to fix this.
I'm implementing a tar archive reader that takes an arbitrary stream (StringIO, File, Zlib::GzipReader, ...) and yields the individual files in the archive. I'd like the yielded file reader to conform as closely as possible to the File interface.
I'd like to implement #getc without necessarily being able to modify the external_encoding of the underlying stream. My strategy so far is to keep reading bytes into a buffer and force_encoding to the target encoding, until I have valid_encoding?. If I know the character length limits, then I can bail out if I still don't have a valid character after I've read the maximum number of bytes, return a string containing only the minimum number of bytes, and hold the extras back for the next invocation of #getc (this seems to be the behaviour of IO#getc).
This is how that would look with the proposed methods:
I hope there are no encodings where valid characters might not be a multiple of the minimum size.
Me too :) it works for now... the only encodings on Ruby 2.4.1 with min_enc_len > 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.
I hope there are no encodings where valid characters might not be a multiple of the minimum size.
Me too :) it works for now... the only encodings on Ruby 2.4.1 with min_enc_len > 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.
Not true. There are quite a few East Asian encodings with max length of 2, 3, or 4. E.g. Shift_JIS, EUC_JP, GB18030,... But it's still true that the maximum size is a multiple of the minimum size.