Bug #19108
closedFormat routines like pack blindly treat a string as ASCII-encoded
Description
Format routines like pack and unpack blindly treat a string as ASCII-encoded, even if they aren't ASCII or ASCII-compatible.
I tried to construct code that was misleading using ASCII-incompatible-encodings but couldn't do it in practice (no ASCII-incompatible encodings have a pack directive ASCII byte that is encoded as a printable character.)
But I could demonstrate at least some strange behaviour:
p ['foo'].pack('u').encoding # => #<Encoding:US-ASCII>
p ['foo'].pack('u'.encode('UTF-32BE')).encoding # => #<Encoding:ASCII-8BIT>
This is because the NUL characters in the second one (which aren't really NUL characters - they're part of the directive characters) explicitly trigger the encoding to change to binary.
There is a warning, but the warning is only for unexpected directives. How about disallowing or warning for non-ascii compatible format strings?
Updated by chrisseaton (Chris Seaton) about 2 years ago
Possibly we should raise an exception if the string is not ascii_only?
Updated by byroot (Jean Boussier) about 2 years ago
I agree that at the very least the unknown pack directive
warning should be made non-verbose (displayed even with $VERBOSE=false
, and would make sense as ArgumentError
.
Updated by Eregon (Benoit Daloze) almost 2 years ago
Agreed, I think it should be ArgumentError since it's otherwise silently ignoring characters in the pack format string.
A non-verbose warning is better than the current state if ArgumentError is deemed too incompatible.
Here is real case where the silent warning caused confusion for [1].pack('<L')
: https://github.com/oracle/truffleruby/issues/2791
EDIT: extracted to #19150
Updated by nobu (Nobuyoshi Nakada) almost 2 years ago
chrisseaton (Chris Seaton) wrote in #note-1:
Possibly we should raise an exception if the string is not
ascii_only?
I think you want to mean "if the string is not ASCII-compatible".
Updated by chrisseaton (Chris Seaton) almost 2 years ago
I think you want to mean "if the string is not ASCII-compatible".
Can you explain why?
I think a string is only a valid pack format string if it is ascii_only?
- if it isn't ascii_only?
then there is a silent warning and the output encoding is changed. We're proposing raising an error up front if the string is not ascii_only?
.
Updated by alanwu (Alan Wu) almost 2 years ago
Checking ascii_only?
would reject non-ascii comments which are fine:
p [2, 89].pack(<<~PACK)
C # 🚗
c
PACK
p [2, 89].pack('Cc')
# Same output
Updated by Eregon (Benoit Daloze) almost 2 years ago
- Related to Bug #19150: pack/unpack silently ignores unknown directives added
Updated by matz (Yukihiro Matsumoto) almost 2 years ago
Template strings should be ASCII compatible, exceptions otherwise.
Matz.
Updated by nobu (Nobuyoshi Nakada) almost 2 years ago
- Status changed from Open to Closed
Applied in changeset git|9869bd1d612b489df806cf95bcb56965a02424e0.
[Bug #19108] Check for the encoding of pack/unpack format