Project

General

Profile

Actions

Bug #18973

closed

Kernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding

Added by andrykonchin (Andrew Konchin) over 1 year ago. Updated over 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:109645]

Description

I've noticed the following behavior:

sprintf("%c".encode("US-ASCII"), 128)
=> "\x80"

sprintf("%c".encode("US-ASCII"), 128).valid_encoding?
=> false

Specifying codepoints 128-255 for ASCII encoded formatting sequence leads to a broken string.

sprintf("%c".encode("US-ASCII"), 255)
=> "\xFF"
sprintf("%c".encode("US-ASCII"), 256)
(irb):17:in `sprintf': 256 out of char range (RangeError)

Specifying codepoint greater that 255 causes the expected exception out of char range.

I suppose this exception should be raised for codepoints 128-255 as well (for ASCII encoding).

Updated by Eregon (Benoit Daloze) over 1 year ago

I noticed https://github.com/ruby/ruby/blob/master/benchmark/app_aobench.rb seems to rely on this behavior.
But that is easily fixed by using # coding: BINARY instead of # coding: US-ASCII.

I think it would be good to fix this issue, so sprintf("%c".encode("US-ASCII"), 128) is out of char range (RangeError), just like it is an exception for:

> 128.chr(Encoding::US_ASCII)
(irb):2:in `chr': invalid codepoint 0x80 in US-ASCII (RangeError)

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

diff --git a/regenc.c b/regenc.c
index 16d62fdf409..5cc3b778351 100644
--- a/regenc.c
+++ b/regenc.c
@@ -627,6 +627,10 @@ onigenc_single_byte_mbc_to_code(const UChar* p, const UChar* end ARG_UNUSED,
 extern int
 onigenc_single_byte_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
 {
+#ifdef RUBY
+  if (code > 0xff)
+    return 0;
+#endif
   return 1;
 }
 

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Sorry, this.

diff --git a/enc/us_ascii.c b/enc/us_ascii.c
index 08f9072c435..9d854b12245 100644
--- a/enc/us_ascii.c
+++ b/enc/us_ascii.c
@@ -7,6 +7,12 @@
 # define ENCINDEX_US_ASCII 0
 #endif
 
+static int
+us_ascii_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
+{
+  return !(code & 0x80);
+}
+
 static int
 us_ascii_mbc_enc_len(const UChar* p, const UChar* e, OnigEncoding enc)
 {
@@ -22,7 +28,7 @@ OnigEncodingDefine(us_ascii, US_ASCII) = {
   1,           /* min byte length */
   onigenc_is_mbc_newline_0x0a,
   onigenc_single_byte_mbc_to_code,
-  onigenc_single_byte_code_to_mbclen,
+  us_ascii_code_to_mbclen,
   onigenc_single_byte_code_to_mbc,
   onigenc_ascii_mbc_case_fold,
   onigenc_ascii_apply_all_case_fold,

Updated by Eregon (Benoit Daloze) over 1 year ago

@nobu (Nobuyoshi Nakada) Looks good to me, could you commit it?
(ARG_UNUSED is not needed on code I think)

Actions #6

Updated by mame (Yusuke Endoh) over 1 year ago

I am concerned about compatibility to change this. @naruse (Yui NARUSE) proposed to return an ASCII-8BIT string instead of raising an exception.

Updated by mame (Yusuke Endoh) over 1 year ago

We brielfly discussed this issue at the dev meeting. @naruse (Yui NARUSE) said it should behave like String#<< as follows. @nobu (Nobuyoshi Nakada) said he would try to implement this.

s = "".force_encoding("US-ASCII")
s << 128
p s          #=> "\x80"
p s.encoding #=> #<Encoding:ASCII-8BIT>

s = "".force_encoding("US-ASCII")
s << 256 #=> 256 out of char range (RangeError)
Actions #8

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Status changed from Open to Closed

Applied in changeset git|576bdec03f0d58847690a0607c788ada433ce60f.


[Bug #18973] Promote US-ASCII to ASCII-8BIT when adding 8-bit char

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0