Project

General

Profile

Feature #13588

Add Encoding#min_char_size, #max_char_size, #minmax_char_size

Added by haines (Andrew Haines) over 2 years ago. Updated about 2 years ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:81335]

Description

When implementing an IO-like object, I'd like to handle encoding correctly. To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).

I'd like to propose adding instance methods min_char_size, max_char_size, and minmax_char_size to the Encoding class to expose the information stored in the OnigEncodingType struct's min_enc_len and max_enc_len fields.

Encoding::UTF_8.min_char_size     # => 1
Encoding::UTF_8.max_char_size     # => 6
Encoding::UTF_8.minmax_char_size  # => [1, 6]

[1] https://github.com/haines/char_size


Related issues

Related to Ruby master - Feature #11094: Remove traces of 6-byte UTF-8Closed04/24/2015Actions
Related to Ruby master - Bug #13590: Change max byte length of UTF-8 to 4 bytes to conform to definition of UTF-8ClosedActions

History

Updated by shevegen (Robert A. Heiler) over 2 years ago

Seems sensible to me. Guess someone from the ruby core team or matz should chime in and perhaps comment - or someone may do this in the next internal ruby dev team meeting. :)

#2

Updated by duerst (Martin Dürst) over 2 years ago

#3

Updated by duerst (Martin Dürst) over 2 years ago

  • Related to Bug #13590: Change max byte length of UTF-8 to 4 bytes to conform to definition of UTF-8 added

Updated by duerst (Martin Dürst) over 2 years ago

  • Status changed from Open to Feedback

haines (Andrew Haines) wrote:

When implementing an IO-like object, I'd like to handle encoding correctly.

This should be possible without knowing the minimum and maximum length of characters in the encoding.

To do so, I need to know the minimum and maximum character sizes for the encoding of the stream I'm reading. However, I can't find a way to access this information from Ruby (I ended up writing a gem with a native extension [1] to do so).

I'd like to propose adding instance methods min_char_size, max_char_size, and minmax_char_size to the Encoding class to expose the information stored in the OnigEncodingType struct's min_enc_len and max_enc_len fields.

It may be that there is indeed something that you need for implementing your IO-like object. But my guess is that its most probably something on a higher abstraction level than min_enc_len and max_enc_len. It would be good to have more specific information about what you actually want to do or are doing. I'm changing the status to feedback.

Encoding::UTF_8.max_char_size # => 6

The max length for UTF-8 should actually be 4. I have submitted a separate bug (#13590) to fix this.

Updated by haines (Andrew Haines) over 2 years ago

I'm implementing a tar archive reader that takes an arbitrary stream (StringIO, File, Zlib::GzipReader, ...) and yields the individual files in the archive. I'd like the yielded file reader to conform as closely as possible to the File interface.

I'd like to implement #getc without necessarily being able to modify the external_encoding of the underlying stream. My strategy so far is to keep reading bytes into a buffer and force_encoding to the target encoding, until I have valid_encoding?. If I know the character length limits, then I can bail out if I still don't have a valid character after I've read the maximum number of bytes, return a string containing only the minimum number of bytes, and hold the extras back for the next invocation of #getc (this seems to be the behaviour of IO#getc).

This is how that would look with the proposed methods:

def getc
  check_not_closed!
  return nil if eof?

  char = String.new(encoding: Encoding::BINARY)
  min_char_size, max_char_size = external_encoding.minmax_char_size

  until char.size == max_char_size || eof?
    char << read(min_char_size)

    char.force_encoding external_encoding
    return encode(char) if char.valid_encoding?
    char.force_encoding Encoding::BINARY
  end

  char.slice!(min_char_size..-1).bytes.reverse_each do |byte|
    ungetbyte byte
  end

  encode(char)
end

Updated by phluid61 (Matthew Kerwin) over 2 years ago

haines (Andrew Haines) wrote:

  until char.size == max_char_size || eof?
    char << read(min_char_size)

I hope there are no encodings where valid characters might not be a multiple of the minimum size.

Updated by haines (Andrew Haines) over 2 years ago

phluid61 (Matthew Kerwin) wrote:

I hope there are no encodings where valid characters might not be a multiple of the minimum size.

Me too :) it works for now... the only encodings on Ruby 2.4.1 with min_enc_len > 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.

Updated by duerst (Martin Dürst) about 2 years ago

haines (Andrew Haines) wrote:

phluid61 (Matthew Kerwin) wrote:

I hope there are no encodings where valid characters might not be a multiple of the minimum size.

Me too :) it works for now... the only encodings on Ruby 2.4.1 with min_enc_len > 1 are UTF-16 and UTF-32; UTF-16 is variable-length with either 1 or 2 16-bit code units, and UTF-32 is fixed-length.

Not true. There are quite a few East Asian encodings with max length of 2, 3, or 4. E.g. Shift_JIS, EUC_JP, GB18030,... But it's still true that the maximum size is a multiple of the minimum size.

Also available in: Atom PDF