Bug #18973
closedKernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding
Description
I've noticed the following behavior:
sprintf("%c".encode("US-ASCII"), 128)
=> "\x80"
sprintf("%c".encode("US-ASCII"), 128).valid_encoding?
=> false
Specifying codepoints 128-255 for ASCII encoded formatting sequence leads to a broken string.
sprintf("%c".encode("US-ASCII"), 255)
=> "\xFF"
sprintf("%c".encode("US-ASCII"), 256)
(irb):17:in `sprintf': 256 out of char range (RangeError)
Specifying codepoint greater that 255 causes the expected exception out of char range
.
I suppose this exception should be raised for codepoints 128-255 as well (for ASCII encoding).
Updated by Eregon (Benoit Daloze) over 2 years ago
I noticed https://github.com/ruby/ruby/blob/master/benchmark/app_aobench.rb seems to rely on this behavior.
But that is easily fixed by using # coding: BINARY
instead of # coding: US-ASCII
.
I think it would be good to fix this issue, so sprintf("%c".encode("US-ASCII"), 128)
is out of char range (RangeError)
, just like it is an exception for:
> 128.chr(Encoding::US_ASCII)
(irb):2:in `chr': invalid codepoint 0x80 in US-ASCII (RangeError)
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/6276
Updated by nobu (Nobuyoshi Nakada) over 2 years ago
diff --git a/regenc.c b/regenc.c
index 16d62fdf409..5cc3b778351 100644
--- a/regenc.c
+++ b/regenc.c
@@ -627,6 +627,10 @@ onigenc_single_byte_mbc_to_code(const UChar* p, const UChar* end ARG_UNUSED,
extern int
onigenc_single_byte_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
{
+#ifdef RUBY
+ if (code > 0xff)
+ return 0;
+#endif
return 1;
}
Updated by nobu (Nobuyoshi Nakada) over 2 years ago
Sorry, this.
diff --git a/enc/us_ascii.c b/enc/us_ascii.c
index 08f9072c435..9d854b12245 100644
--- a/enc/us_ascii.c
+++ b/enc/us_ascii.c
@@ -7,6 +7,12 @@
# define ENCINDEX_US_ASCII 0
#endif
+static int
+us_ascii_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
+{
+ return !(code & 0x80);
+}
+
static int
us_ascii_mbc_enc_len(const UChar* p, const UChar* e, OnigEncoding enc)
{
@@ -22,7 +28,7 @@ OnigEncodingDefine(us_ascii, US_ASCII) = {
1, /* min byte length */
onigenc_is_mbc_newline_0x0a,
onigenc_single_byte_mbc_to_code,
- onigenc_single_byte_code_to_mbclen,
+ us_ascii_code_to_mbclen,
onigenc_single_byte_code_to_mbc,
onigenc_ascii_mbc_case_fold,
onigenc_ascii_apply_all_case_fold,
Updated by Eregon (Benoit Daloze) over 2 years ago
@nobu (Nobuyoshi Nakada) Looks good to me, could you commit it?
(ARG_UNUSED is not needed on code
I think)
Updated by mame (Yusuke Endoh) over 2 years ago
I am concerned about compatibility to change this. @naruse (Yui NARUSE) proposed to return an ASCII-8BIT string instead of raising an exception.
Updated by mame (Yusuke Endoh) over 2 years ago
We brielfly discussed this issue at the dev meeting. @naruse (Yui NARUSE) said it should behave like String#<<
as follows. @nobu (Nobuyoshi Nakada) said he would try to implement this.
s = "".force_encoding("US-ASCII")
s << 128
p s #=> "\x80"
p s.encoding #=> #<Encoding:ASCII-8BIT>
s = "".force_encoding("US-ASCII")
s << 256 #=> 256 out of char range (RangeError)
Updated by nobu (Nobuyoshi Nakada) about 2 years ago
- Status changed from Open to Closed
Applied in changeset git|576bdec03f0d58847690a0607c788ada433ce60f.
[Bug #18973] Promote US-ASCII to ASCII-8BIT when adding 8-bit char
Updated by nobu (Nobuyoshi Nakada) 6 months ago
- Related to Bug #20566: string << 0xC2 should raise a RangeError if the string encoding is Encoding::ASCII added