Project

General

Profile

Bug #14400

IO#ungetc and IO#ungetbyte documentation is inconsistent with the behavior

Added by Eregon (Benoit Daloze) 10 months ago. Updated 9 months ago.

Status:
Feedback
Priority:
Normal
Target version:
-
ruby -v:
ruby 2.6.0dev (2018-01-25 trunk 62035) [x86_64-linux]
[ruby-core:85108]

Description

The documentation of IO#ungetc states:

Pushes back one character (passed as a parameter) onto ios, such that a
subsequent buffered character read will return it. Only one character may be
pushed back before a subsequent read operation (that is, you will be able to
read only the last of several characters that have been pushed back). Has no
effect with unbuffered reads (such as IO#sysread).

And similar for IO#ungetbyte:

Pushes back bytes (passed as a parameter) onto ios, such that a
subsequent buffered read will return it. Only one byte may be pushed back
before a subsequent read operation (that is, you will be able to read only the
last of several bytes that have been pushed back). Has no effect with
unbuffered reads (such as IO#sysread).

The part about only one byte/character is inconsistent with the actual behavior,
most notably because both of these methods accept a String with multiple characters as argument.

STDIN.ungetc "Hello World!"
STDIN.read 12 #=> "Hello World!"

STDIN.ungetbyte "Foo Bar"
STDIN.read 7 #=> "Foo Bar"

(There are even specs for it:
https://github.com/ruby/spec/blob/7fa22023d69620ea3ff4d0ed2eb71fd7b02dd950/core/io/ungetc_spec.rb#L98
https://github.com/ruby/spec/blob/7fa22023d69620ea3ff4d0ed2eb71fd7b02dd950/core/io/ungetbyte_spec.rb#L21)

that is, you will be able to read only the last of several characters that have been pushed back

is contradicting what happens.

The behavior with large Strings is confusing.
It seems to allow arbitrarily large strings (but only if there was not a ungetbyte already/the buffer was empty?).

$ pry
[1] pry(main)> STDIN.ungetbyte "a"*10_000
=> nil
[2] pry(main)> STDIN.ungetbyte "a"*10_000
IOError: ungetbyte failed

$ pry
[1] pry(main)> STDIN.ungetbyte "a"*100_000
=> nil
[2] pry(main)> STDIN.ungetbyte "a"*100_000
IOError: ungetbyte failed
from (pry):2:in `ungetbyte'

$ pry
[1] pry(main)> STDIN.ungetbyte "a"*100_000
=> nil
[2] pry(main)> STDIN.read(100_000).size
=> 100000
[3] pry(main)> STDIN.ungetbyte "a"*100_000
=> nil
[4] pry(main)> STDIN.read(100_000).size
=> 100000

And it's not as simple as if two consecutive ungetbyte were forbidden:

$ pry
[1] pry(main)> STDIN.ungetbyte "a"*10_000_000
=> nil
[2] pry(main)> STDIN.ungetbyte "a"
IOError: ungetbyte failed
from (pry):2:in `ungetbyte'

$ pry
[1] pry(main)> STDIN.ungetbyte "a"
=> nil
[2] pry(main)> STDIN.ungetbyte "a"
=> nil

So how are those methods supposed to behave?
Can the documentation be updated to match the behavior and/or the behavior be fixed to be simpler?

I also wonder when those methods are useful.
There seems to be very few usages in the stdlib.
Maybe they should just be removed?
It seems easy to make a custom IO wrapper/buffer supporting pushing characters/bytes back.

History

#1 [ruby-core:85672] Updated by mame (Yusuke Endoh) 9 months ago

  • Assignee set to akr (Akira Tanaka)
  • Status changed from Open to Assigned

#2 [ruby-core:85676] Updated by akr (Akira Tanaka) 9 months ago

  • Status changed from Assigned to Feedback

Basically, ungetbyte/ungetc works if there is enough space in the IO buffer.

Since the IO buffer is not dynamically extended and
the buffer size is internal behavior,
it is very small guarantee that we have enough space:
(1) There is 1 byte space at beginning and
(2) there is 1 byte space just after reading some data.

Note that current implementation of ungetbyte allocates
enough buffer to save given string if IO buffer is not allocated yet.
(This is the reason we can ungetbyte large data at beginning.)

Some idea:

  • describe the above guarantee in the document.
  • refine error message: "ungetbyte failed" to "ungetbyte: IO buffer space not enough"

Also, I found a situation that we cannot push back one (multibyte) character, now.

% ./ruby -e 'print "a" * 1024*32'|./ruby -e 'p STDIN.getc; STDIN.ungetc "\u3042"'
"a"
Traceback (most recent call last):
    1: from -e:1:in `<main>'
-e:1:in `ungetc': ungetbyte failed (IOError)

I'm not sure what we can here.

#3 [ruby-core:85692] Updated by akr (Akira Tanaka) 9 months ago

Even worse, I found an example which ungetc can not push back
a character which just returned by getc.

% ./ruby -e 'print "a" * (1024*8-2) + "\n\u0080" + "b" * (1024*8-1)'|
./ruby -e 'STDIN.gets; c = STDIN.getc; p c; STDIN.ungetc(c)'
"\u0080"
Traceback (most recent call last):
    1: from -e:1:in `<main>'
-e:1:in `ungetc': ungetbyte failed (IOError)

So, current implementation guarantee that only one "byte"
can be pushed back after reading data.
It is not enough to push back a multibyte character.

% ./ruby -v
ruby 2.6.0dev (2018-02-20 trunk 62490) [x86_64-linux]

#4 [ruby-core:85693] Updated by Eregon (Benoit Daloze) 9 months ago

Could the buffer grow automatically to allow pushing multiple bytes/characters back?
That's the current implementation in TruffleRuby and Rubinius, the buffer grows on ungetbyte/ungetc if there is not enough space.
Then the semantics for the user would be much simpler.

#5 [ruby-core:85702] Updated by akr (Akira Tanaka) 9 months ago

I think it is possible to glow the IO read buffer (rbuf) if it is properly locked.

Since the IO buffer is modified without GVL
(to avoid whole process blocking),
glowing (realloc) the buffer without lock is too dangerous.

It seems rb_io_t has no lock for rbuf, now.
(I guess write_lock is not for the lock for rbuf.)

Also available in: Atom PDF