Bug #21842
openEncoding of rb_interned_str
Description
This is one of the API methods to get an fstring. The documentation in the source says the following:
/**
* Identical to rb_str_new(), except it returns an infamous "f"string. What is
* a fstring? Well it is a special subkind of strings that is immutable,
* deduped globally, and managed by our GC. It is much like a Symbol (in fact
* Symbols are dynamic these days and are backended using fstrings). This
* concept has been silently introduced at some point in 2.x era. Since then
* it gained wider acceptance in the core. Starting from 3.x extension
* libraries can also generate ones.
*
* @param[in] ptr A memory region of `len` bytes length.
* @param[in] len Length of `ptr`, in bytes, not including the
* terminating NUL character.
* @exception rb_eArgError `len` is negative.
* @return A found or created instance of ::rb_cString, of `len` bytes
* length, of "binary" encoding, whose contents are identical to
* that of `ptr`.
* @pre At least `len` bytes of continuous memory region shall be
* accessible via `ptr`.
*/
VALUE rb_interned_str(const char *ptr, long len);
I tried to create some specs for them (https://github.com/ruby/spec/pull/1327), but instead of binary encoding, the string is actually encoded as US-ASCII. This may result is some weird behaviour if the input contains bytes that are not valid in US-ASCII (the following is more an observation of the current behaviour)
it "support binary strings that are invalid in ASCII encoding" do
str = "foo\x81bar\x82baz".b
result = @s.rb_interned_str(str, str.bytesize)
result.encoding.should == Encoding::US_ASCII
result.should == str.dup.force_encoding(Encoding::US_ASCII)
result.should_not.valid_encoding?
end
So it seems to me like either the implementation of the documentation is incorrect.
(rb_interned_str_cstr has the same behaviour, it's pretty much the same thing except using a null terminator instead of an explicit length argument).
Updated by herwin (Herwin W) about 24 hours ago
- ruby -v set to ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux], but seen on 3.0 - 4.1-dev
Updated by byroot (Jean Boussier) about 22 hours ago
- Related to Feature #13381: [PATCH] Expose rb_fstring and its family to C extensions added
Updated by byroot (Jean Boussier) about 22 hours ago
Hum, good find. So the function was exposed as a result of [Feature #13381], before that the function was internal.
In that ticket we didn't discuss the default encoding, but it might be fair to assume it should have been BINARY (aka ASCII-8BIT) like rb_str_new*.
The function was later documented in https://github.com/ruby/ruby/commit/091faca99ca and assumed to default to ASCII-8BIT.
At first glance I'd say it makes sense to treat this as a bug and change the default encoding.
On the other hand, one could argue that interned binary strings don't make that much sense.
I don't have a strong opinion either way.
Updated by Eregon (Benoit Daloze) about 21 hours ago
From https://github.com/truffleruby/truffleruby/issues/4018#issuecomment-3549329873, it seems everyone's expectation is that it returns a BINARY String, like rb_str_new().
@byroot (Jean Boussier) Could you make a PR to fix it?
Updated by byroot (Jean Boussier) about 20 hours ago
Updated by byroot (Jean Boussier) about 20 hours ago
- Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN to 3.2: WONTFIX, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED
Updated by byroot (Jean Boussier) about 18 hours ago
Fixed merged (Redmine seem to be lagging behind, but will probably pick it up).
Backport PRs:
Updated by nobu (Nobuyoshi Nakada) about 17 hours ago
ยท Edited
I think it should be US-ASCII for 7bit only strings, as well as Symbols.
GH-15894