Feature #13588
closedAdd Encoding#min_char_size, #max_char_size, #minmax_char_size
Description
When implementing an IO-like object, I'd like to handle encoding correctly. To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).
I'd like to propose adding instance methods min_char_size
, max_char_size
, and minmax_char_size
to the Encoding
class to expose the information stored in the OnigEncodingType
struct's min_enc_len
and max_enc_len
fields.
Encoding::UTF_8.min_char_size # => 1
Encoding::UTF_8.max_char_size # => 6
Encoding::UTF_8.minmax_char_size # => [1, 6]
Updated by shevegen (Robert A. Heiler) almost 8 years ago
Seems sensible to me. Guess someone from the ruby core team or matz should chime in and perhaps comment - or someone may do this in the next internal ruby dev team meeting. :)
Updated by duerst (Martin Dürst) almost 8 years ago
- Related to Feature #11094: Remove traces of 6-byte UTF-8 added
Updated by duerst (Martin Dürst) almost 8 years ago
- Related to Bug #13590: Change max byte length of UTF-8 to 4 bytes to conform to definition of UTF-8 added
Updated by duerst (Martin Dürst) almost 8 years ago
- Status changed from Open to Feedback
haines (Andrew Haines) wrote:
When implementing an IO-like object, I'd like to handle encoding correctly.
This should be possible without knowing the minimum and maximum length of characters in the encoding.
To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).
I'd like to propose adding instance methods
min_char_size
,max_char_size
, andminmax_char_size
to theEncoding
class to expose the information stored in theOnigEncodingType
struct'smin_enc_len
andmax_enc_len
fields.
It may be that there is indeed something that you need for implementing your IO-like object. But my guess is that its most probably something on a higher abstraction level than min_enc_len
and max_enc_len
. It would be good to have more specific information about what you actually want to do or are doing. I'm changing the status to feedback.
Encoding::UTF_8.max_char_size # => 6
The max length for UTF-8 should actually be 4. I have submitted a separate bug (#13590) to fix this.
Updated by haines (Andrew Haines) almost 8 years ago
I'm implementing a tar archive reader that takes an arbitrary stream (StringIO
, File
, Zlib::GzipReader
, ...) and yields the individual files in the archive. I'd like the yielded file reader to conform as closely as possible to the File
interface.
I'd like to implement #getc
without necessarily being able to modify the external_encoding
of the underlying stream. My strategy so far is to keep reading bytes into a buffer and force_encoding
to the target encoding, until I have valid_encoding?
. If I know the character length limits, then I can bail out if I still don't have a valid character after I've read the maximum number of bytes, return a string containing only the minimum number of bytes, and hold the extras back for the next invocation of #getc
(this seems to be the behaviour of IO#getc
).
This is how that would look with the proposed methods:
def getc
check_not_closed!
return nil if eof?
char = String.new(encoding: Encoding::BINARY)
min_char_size, max_char_size = external_encoding.minmax_char_size
until char.size == max_char_size || eof?
char << read(min_char_size)
char.force_encoding external_encoding
return encode(char) if char.valid_encoding?
char.force_encoding Encoding::BINARY
end
char.slice!(min_char_size..-1).bytes.reverse_each do |byte|
ungetbyte byte
end
encode(char)
end
Updated by phluid61 (Matthew Kerwin) almost 8 years ago
haines (Andrew Haines) wrote:
until char.size == max_char_size || eof? char << read(min_char_size)
I hope there are no encodings where valid characters might not be a multiple of the minimum size.
Updated by haines (Andrew Haines) almost 8 years ago
phluid61 (Matthew Kerwin) wrote:
I hope there are no encodings where valid characters might not be a multiple of the minimum size.
Me too :) it works for now... the only encodings on Ruby 2.4.1 with min_enc_len
> 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.
Updated by duerst (Martin Dürst) almost 8 years ago
haines (Andrew Haines) wrote:
phluid61 (Matthew Kerwin) wrote:
I hope there are no encodings where valid characters might not be a multiple of the minimum size.
Me too :) it works for now... the only encodings on Ruby 2.4.1 with
min_enc_len
> 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.
Not true. There are quite a few East Asian encodings with max length of 2, 3, or 4. E.g. Shift_JIS, EUC_JP, GB18030,... But it's still true that the maximum size is a multiple of the minimum size.