Project

General

Profile

Actions

Feature #14919

open

Add String#byteinsert

Added by aycabta (aycabta .) over 5 years ago. Updated about 1 year ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:87975]

Description

It's important for multibyte String editing. Unicode grapheme characters sometimes have plural code points. In text editing, software sometimes should add a new code point to an existing grapheme character. String#byteinsert is important for it.

I implemented by pure Ruby in my code.
https://github.com/aycabta/reline/blob/b17e5fd61092adfd7e87d576301e4e19a4d9e6d8/lib/reline/line_editor.rb#L255-L260

Actions #1

Updated by aycabta (aycabta .) over 5 years ago

  • Tracker changed from Bug to Feature
  • Backport deleted (2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN)

Updated by duerst (Martin Dürst) over 5 years ago

aycabta (aycabta .) wrote:

It's important for multibyte String editing. Unicode grapheme characters sometimes have plural code points. In text editing, software sometimes should add a new code point to an existing grapheme character. String#byteinsert is important for it.

Can you explain this a bit more? Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.

Updated by aycabta (aycabta .) over 5 years ago

duerst (Martin Dürst) wrote:

Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.

Input from CLI

In CLI tool, all characters come as each of the bytes. All multibyte characters are split. In the middle of a line, a software should use an insertion of a new character but not a replacement.

Yank

In the middle of a line, yank manipulation needs #byteinsert for multibyte editing.

Updated by duerst (Martin Dürst) over 5 years ago

aycabta (aycabta .) wrote:

duerst (Martin Dürst) wrote:

Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.

Input from CLI

In CLI tool, all characters come as each of the bytes. All multibyte characters are split.

On the lowest level, characters indeed come in as a string of bytes. But it would be wrong to insert individual bytes into a string unless these bytes are also characters. It would just lead to mojibake.

The right thing to do is to collect a (small) number of bytes, check how many bytes are needed to form one or more characters, insert these characters into the string, and keep the remaining bytes for further processing (wait until more bytes arrive so that we get more complete codepoints/characters).

In the middle of a line, a software should use an insertion of a new character but not a replacement.

Insertion of characters can be done with String#[]=.

Yank

In the middle of a line, yank manipulation needs #byteinsert for multibyte editing.

I still don't see why. You don't want to insert bytes, you want to insert characters, so that the String is correctly encoded at all times.

Updated by shevegen (Robert A. Heiler) over 5 years ago

I don't have a specific opinion on the suggestion itself; Martin raised some valid
points, in my opinion. But I wanted to comment on something else.

There have been some suggestions to the developer meeting, as recently as 8 hours
ago; so probably just shortly before the developer meeting started:

https://bugs.ruby-lang.org/issues/14861

This is a very short time frame. I would like to suggest to give a little bit more
time before the developer meeting, so that other people can also comment on the
suggestions. Something like +24 hours or so if it has not yet discussed; I feel
that ~8 hours without any real possibility for a discussion is very, very short.

Updated by noraj (Alexandre ZANNI) about 1 year ago

Yes a grapheme can be composed of several code points.

An example is variant selector:

irb(main):001:0> a = "\u2665\n\u2764\n\u2665\ufe0f\n\u2764\ufe0f"
=> "♥\n\n♥️\n❤️"
irb(main):002:0> puts a

                                                
♥️                                               
❤️                                               
=> nil                                           
irb(main):003:0> a.chars
=> ["♥", "\n", "❤", "\n", "♥", "️", "\n", "❤", "️"]

But fortunately, in Ruby, string indices are already mapping characters and not graphemes. So has Martin highlighted, String#[]= already cover all use cases I can think of.

irb(main):007:0> r = "I \u2665 Ruby!"
=> "I ♥ Ruby!"
irb(main):009:0> r[2] = "\u2764\ufe0f"
=> "❤️"
irb(main):010:0> r
=> "I ❤️ Ruby!"

The only thing I could think of String#byteinsert would be to directly mess with UTF-8 encoding to forge invalid encoding on purpose. But such a use case is rare and advanced and so can maybe be handled with pack and unpack rather than creating a new byteinsert method?

irb(main):014:0> r.unpack1('a*')
=> "I \xE2\x9D\xA4\xEF\xB8\x8F Ruby!"

@aycabta (aycabta .) Maybe you could give me a handy example of the usage of String#byteinsert I can't think of?

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0