Bug #18973: Kernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding - Ruby master - Ruby Issue Tracking System

Actions

Copy link

Bug #18973

closed

Kernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding

Added by andrykonchin (Andrew Konchin) over 1 year ago. Updated over 1 year ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

3.0.3

Backport:

2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN

[ruby-core:109645]

Description

I've noticed the following behavior:

sprintf("%c".encode("US-ASCII"), 128)
=> "\x80"

sprintf("%c".encode("US-ASCII"), 128).valid_encoding?
=> false

Specifying codepoints 128-255 for ASCII encoded formatting sequence leads to a broken string.

sprintf("%c".encode("US-ASCII"), 255)
=> "\xFF"
sprintf("%c".encode("US-ASCII"), 256)
(irb):17:in `sprintf': 256 out of char range (RangeError)

Specifying codepoint greater that 255 causes the expected exception out of char range.

I suppose this exception should be raised for codepoints 128-255 as well (for ASCII encoding).

Actions

Copy link

#1 [ruby-core:109646]

Updated by Eregon (Benoit Daloze) over 1 year ago

I noticed https://github.com/ruby/ruby/blob/master/benchmark/app_aobench.rb seems to rely on this behavior.
But that is easily fixed by using # coding: BINARY instead of # coding: US-ASCII.

I think it would be good to fix this issue, so sprintf("%c".encode("US-ASCII"), 128) is out of char range (RangeError), just like it is an exception for:

> 128.chr(Encoding::US_ASCII)
(irb):2:in `chr': invalid codepoint 0x80 in US-ASCII (RangeError)

Actions

Copy link

#2 [ruby-core:109650]

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/6276

Actions

Copy link

#3 [ruby-core:109654]

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

diff --git a/regenc.c b/regenc.c
index 16d62fdf409..5cc3b778351 100644
--- a/regenc.c
+++ b/regenc.c
@@ -627,6 +627,10 @@ onigenc_single_byte_mbc_to_code(const UChar* p, const UChar* end ARG_UNUSED,
 extern int
 onigenc_single_byte_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
 {
+#ifdef RUBY
+  if (code > 0xff)
+    return 0;
+#endif
   return 1;
 }

Actions

Copy link

#4 [ruby-core:109655]

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Sorry, this.

diff --git a/enc/us_ascii.c b/enc/us_ascii.c
index 08f9072c435..9d854b12245 100644
--- a/enc/us_ascii.c
+++ b/enc/us_ascii.c
@@ -7,6 +7,12 @@
 # define ENCINDEX_US_ASCII 0
 #endif
 
+static int
+us_ascii_code_to_mbclen(OnigCodePoint code ARG_UNUSED, OnigEncoding enc ARG_UNUSED)
+{
+  return !(code & 0x80);
+}
+
 static int
 us_ascii_mbc_enc_len(const UChar* p, const UChar* e, OnigEncoding enc)
 {
@@ -22,7 +28,7 @@ OnigEncodingDefine(us_ascii, US_ASCII) = {
   1,           /* min byte length */
   onigenc_is_mbc_newline_0x0a,
   onigenc_single_byte_mbc_to_code,
-  onigenc_single_byte_code_to_mbclen,
+  us_ascii_code_to_mbclen,
   onigenc_single_byte_code_to_mbc,
   onigenc_ascii_mbc_case_fold,
   onigenc_ascii_apply_all_case_fold,

Actions

Copy link

#5 [ruby-core:109657]

Updated by Eregon (Benoit Daloze) over 1 year ago

@nobu (Nobuyoshi Nakada) Looks good to me, could you commit it?
(ARG_UNUSED is not needed on code I think)

Actions

Copy link

Updated by mame (Yusuke Endoh) over 1 year ago

I am concerned about compatibility to change this. @naruse (Yui NARUSE) proposed to return an ASCII-8BIT string instead of raising an exception.

Actions

Copy link

#7 [ruby-core:109688]

Updated by mame (Yusuke Endoh) over 1 year ago

We brielfly discussed this issue at the dev meeting. @naruse (Yui NARUSE) said it should behave like String#<< as follows. @nobu (Nobuyoshi Nakada) said he would try to implement this.

s = "".force_encoding("US-ASCII")
s << 128
p s          #=> "\x80"
p s.encoding #=> #<Encoding:ASCII-8BIT>

s = "".force_encoding("US-ASCII")
s << 256 #=> 256 out of char range (RangeError)

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Status changed from Open to Closed

Applied in changeset git|576bdec03f0d58847690a0607c788ada433ce60f.

[Bug #18973] Promote US-ASCII to ASCII-8BIT when adding 8-bit char

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby » Ruby master

Custom queries

Bug #18973

Kernel#sprintf: %c allows codepoints above 127 for 7-bits ASCII encoding

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by mame (Yusuke Endoh) over 1 year ago

Updated by mame (Yusuke Endoh) over 1 year ago

Updated by nobu (Nobuyoshi Nakada) over 1 year ago