Project

General

Profile

Actions

Bug #21842

closed

Encoding of rb_interned_str

Bug #21842: Encoding of rb_interned_str

Added by herwin (Herwin W) about 1 month ago. Updated 14 days ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux], but seen on 3.0 - 4.1-dev
[ruby-core:124579]

Description

This is one of the API methods to get an fstring. The documentation in the source says the following:

/**
 * Identical to rb_str_new(), except it returns an infamous "f"string.  What is
 * a  fstring?  Well  it is  a special  subkind of  strings that  is immutable,
 * deduped globally, and managed by our GC.   It is much like a Symbol (in fact
 * Symbols  are dynamic  these days  and are  backended using  fstrings).  This
 * concept has been  silently introduced at some point in  2.x era.  Since then
 * it  gained  wider acceptance  in  the  core.   Starting from  3.x  extension
 * libraries can also generate ones.
 *
 * @param[in]  ptr           A memory region of `len` bytes length.
 * @param[in]  len           Length  of  `ptr`,  in bytes,  not  including  the
 *                           terminating NUL character.
 * @exception  rb_eArgError  `len` is negative.
 * @return     A  found or  created instance  of ::rb_cString,  of `len`  bytes
 *             length, of  "binary" encoding,  whose contents are  identical to
 *             that of `ptr`.
 * @pre        At  least  `len` bytes  of  continuous  memory region  shall  be
 *             accessible via `ptr`.
 */
VALUE rb_interned_str(const char *ptr, long len);

I tried to create some specs for them (https://github.com/ruby/spec/pull/1327), but instead of binary encoding, the string is actually encoded as US-ASCII. This may result is some weird behaviour if the input contains bytes that are not valid in US-ASCII (the following is more an observation of the current behaviour)

it "support binary strings that are invalid in ASCII encoding" do
  str = "foo\x81bar\x82baz".b
  result = @s.rb_interned_str(str, str.bytesize)
  result.encoding.should == Encoding::US_ASCII
  result.should == str.dup.force_encoding(Encoding::US_ASCII)
  result.should_not.valid_encoding?
end

So it seems to me like either the implementation of the documentation is incorrect.

(rb_interned_str_cstr has the same behaviour, it's pretty much the same thing except using a null terminator instead of an explicit length argument).


Related issues 1 (0 open1 closed)

Related to Ruby - Feature #13381: [PATCH] Expose rb_fstring and its family to C extensionsClosedActions

Updated by herwin (Herwin W) about 1 month ago Actions #1

  • ruby -v set to ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux], but seen on 3.0 - 4.1-dev

Updated by byroot (Jean Boussier) about 1 month ago Actions #2

  • Related to Feature #13381: [PATCH] Expose rb_fstring and its family to C extensions added

Updated by byroot (Jean Boussier) about 1 month ago Actions #3 [ruby-core:124580]

Hum, good find. So the function was exposed as a result of [Feature #13381], before that the function was internal.

In that ticket we didn't discuss the default encoding, but it might be fair to assume it should have been BINARY (aka ASCII-8BIT) like rb_str_new*.

The function was later documented in https://github.com/ruby/ruby/commit/091faca99ca and assumed to default to ASCII-8BIT.

At first glance I'd say it makes sense to treat this as a bug and change the default encoding.

On the other hand, one could argue that interned binary strings don't make that much sense.

I don't have a strong opinion either way.

Updated by Eregon (Benoit Daloze) about 1 month ago Actions #4 [ruby-core:124581]

From https://github.com/truffleruby/truffleruby/issues/4018#issuecomment-3549329873, it seems everyone's expectation is that it returns a BINARY String, like rb_str_new().
@byroot (Jean Boussier) Could you make a PR to fix it?

Updated by byroot (Jean Boussier) about 1 month ago Actions #6

  • Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN to 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED

Updated by byroot (Jean Boussier) about 1 month ago 1Actions #7 [ruby-core:124584]

Fixed merged (Redmine seem to be lagging behind, but will probably pick it up).

Backport PRs:

Updated by nobu (Nobuyoshi Nakada) about 1 month ago ยท Edited Actions #8 [ruby-core:124585]

I think it should be US-ASCII for 7bit only strings, as well as Symbols.
GH-15894

Updated by herwin (Herwin W) about 1 month ago Actions #9 [ruby-core:124588]

I've made a short update of the documentation in https://github.com/ruby/ruby/pull/15897, mostly to explain what information is used to determine the encoding of the result.

I've tried to keep the line width usage similar to the original, which meant doubling some random spaces until it lined up. I would not mind dropping this dependency, since it makes updating these texts a whole lot easier.

Updated by herwin (Herwin W) about 1 month ago Actions #10

  • Status changed from Open to Closed

Applied in changeset git|b4a62a1ca949d93332ad8bce0fcc273581160cc5.


[DOC] Update docs for rb_interned_str and related functions (#15897)

Related to [Bug #21842].

  • rb_interned_str: document what decides whether the returned string is
    in US-ASCII or BINARY encoding.
  • rb_interned_str_cstr: include the same description as rb_interned_str
    for the encoding. This one was still missing the update for US-ASCII
    and erroneously said the returned string was alwasy in BINARY encoding
  • rb_str_to_interned_str: document how the encoding of the result is
    defined.

Co-authored-by: Herwin

Updated by k0kubun (Takashi Kokubun) 14 days ago Actions #11 [ruby-core:124745]

  • Backport changed from 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE
Actions

Also available in: PDF Atom