Bug #20869
closedIO buffer handling is inconsistent when seeking
Description
When performing any of the seek based operations on IO (IO#seek, IO#pos=, or IO#rewind), the read buffer is inconsistently cleared:
require 'tempfile'
Tempfile.open do |f|
f.write('0123456789')
f.rewind
# Calling #ungetbyte as the first read buffer
# operation uses a buffer that is preserved during
# seek operations
f.ungetbyte(97)
# Byte buffer will not be cleared
f.seek(2, :SET)
f.getbyte # => 97
end
Tempfile.open do |f|
f.write('0123456789')
f.rewind
# Calling #getbyte before #ungetbyte uses a
# buffer that is not preserved when seeking
f.getbyte
f.ungetbyte(97)
# Byte buffer will be cleared
f.seek(2, :SET)
f.getbyte # => 50
end
Similar behavior happens when reading characters:
require 'tempfile'
Tempfile.open do |f|
f.write('0123456789')
f.rewind
# Calling #ungetc as the first read buffer
# operation uses a buffer that is preserved during
# seek operations
f.ungetc('a')
# Character buffer will not be cleared
f.seek(2, :SET)
f.getc # => 'a'
end
Tempfile.open do |f|
f.write('0123456789')
f.rewind
# Calling #getc before #ungetc uses a
# buffer that is not preserved when seeking
f.getc
f.ungetc('a')
# Character buffer will be cleared
f.seek(2, :SET)
f.getc # => '2'
end
When transcoding, however, the character buffer is never cleared when seeking:
require 'tempfile'
Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
f.write('0123456789')
f.rewind
f.ungetc('a'.encode('utf-16le'))
# Character buffer will not be cleared
f.seek(2, :SET)
f.getc # => 'a'.encode('utf-16le')
end
Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
f.write('0123456789')
f.rewind
f.getc
f.ungetc('a'.encode('utf-16le'))
# Character buffer will not be cleared
f.seek(2, :SET)
f.getc # => 'a'.encode('utf-16le')
end
I would expect the buffers to be cleared in all cases except possibly when the seek operation doesn't actually move the file pointer such as when calling IO#pos or IO#seek(0, :CUR). The inconsistent behavior demonstrated here is a problem regardless though.
Updated by byroot (Jean Boussier) 8 days ago ยท Edited
I just looked into this a bit, I'm not quite familiar enough with the code to really propose a fix, but I get what is happening:
ungetbyte just shift the buffer offset, but the FD offset in unchanged.
static void
io_ungetbyte(VALUE str, rb_io_t *fptr)
{
// snip...
// ungetbyte just shift the buffer offset, but the FD offset in unchanged
fptr->rbuf.off-=(int)len;
fptr->rbuf.len+=(int)len;
MEMMOVE(fptr->rbuf.ptr+fptr->rbuf.off, RSTRING_PTR(str), char, len);
}
fptr->rbuf.len == 1
, but real FD offset is 0
So we're doing lseek(-1)
which fail with EINVAL
static void
io_unread(rb_io_t *fptr)
{
rb_off_t r;
rb_io_check_closed(fptr);
if (fptr->rbuf.len == 0 || fptr->mode & FMODE_DUPLEX)
return;
/* xxx: target position may be negative if buffer is filled by ungetc */
errno = 0;
// fptr->rbuf.len == 1, but real FD offset is 0
// So we're doing lseek(-1) which fail with EINVAL
r = lseek(fptr->fd, -fptr->rbuf.len, SEEK_CUR);
if (r < 0 && errno) {
if (errno == ESPIPE)
fptr->mode |= FMODE_DUPLEX;
return;
}
fptr->rbuf.off = 0;
fptr->rbuf.len = 0;
return;
}
So I suppose some more tracking info is needed to know that the real FD position and the buffer offset are desynced.
Updated by byroot (Jean Boussier) 8 days ago
Just a quick proof of concept that fixes the first case: https://github.com/ruby/ruby/commit/7481a12fef3df934ab0d9db7f8f2d36341a1562e
But I think a lot more codepath would need to consider and update that new offset for the entire class of bug to be fixed.
Updated by javanthropus (Jeremy Bopp) 7 days ago
I think your code change highlights another bug caused by the current behavior where IO#pos
can report negative values. Oddly, IO#seek(0, :CUR)
still returns 0:
require 'tempfile'
Tempfile.open do |f|
f.write('0123456789')
f.rewind
f.ungetbyte(97)
f.pos # => -1
f.seek(0, :CUR) # => 0
end
Note that IO#pos
works correctly when used with IO#ungetc
while transcoding since that cases causes an entirely different buffer to be used which is currently ignored by the seeking functions. As demonstrated in the issue description though, that buffer isn't ever cleared when using the seeking functions.
Conceptually, it makes sense to me that the seeking functions should only care about bytes from the underlying stream since that's what they operate on. They should ignore read buffer manipulation by IO#ungetbyte
and IO#ungetc
since the data pushed by those methods have no relationship to the bytes of the stream. What I don't see anywhere I've looked is a statement regarding how IO#ungetbyte
and IO#ungetc
should interact with seeking operations. The existing specs and docs don't seem to cover those cases.
It would be great to get clarification here before working on solutions. While I think the best solution would be to disregard the bytes added by IO#ungetbyte
and IO#ungetc
and to clear the relevant buffers when seeking, I can imagine others may prefer to preserve the buffers. Maybe the solution is to leave the behavior deliberately undefined and just warn people against mixing these methods via documentation.
Updated by byroot (Jean Boussier) 7 days ago
I'll add this to the next developer meeting.
Updated by javanthropus (Jeremy Bopp) 7 days ago
Considering a parallel, the manpage for the fseek(3) function clearly states that the effects of ungetc(3) are undone on successful calls to fseek(3).
Updated by nobu (Nobuyoshi Nakada) 6 days ago
The buffers and Encoding::Converter
s should be discarded at positioning, we think.
Updated by byroot (Jean Boussier) 6 days ago
Right, but just discarding the buffers isn't enough, because in the case of SEEK_CUR
, you need to know exactly how much ahead of the expected cursor you really are.
The only solution I see is to keep a count of how many bytes were added through ungetc
etc, which is what I outlined in https://github.com/ruby/ruby/commit/7481a12fef3df934ab0d9db7f8f2d36341a1562e
Updated by nobu (Nobuyoshi Nakada) 6 days ago
- Status changed from Open to Closed
Applied in changeset git|ee29aade1a432cbd5e4d5ec6ea6ccb4fa87361be.
[Bug #20869] Discard read buffer and encoding converters at seeking
Updated by javanthropus (Jeremy Bopp) 6 days ago
The documentation that was added says that IO#tell
and IO#pos
would clear the buffers, but they appear to have a special code path now to avoid it. IO#seek(0, :CUR)
doesn't share this behavior, and that's a curious inconsistency.
Updated by nobu (Nobuyoshi Nakada) 5 days ago
The documentation is outdated, I'll fix it.
Since io.pos
(not assignment) looks mere attribute, differentiated from seek
.
Doesn't this make sense?
Updated by javanthropus (Jeremy Bopp) 5 days ago
Since io.pos (not assignment) looks mere attribute, differentiated from seek.
If not for the fact that IO#seek
always returns 0 regardless of its arguments (something I've never understood), IO#pos
could be implemented as IO#seek(0, :CUR)
. Why not avoid busting the buffer in that case? On the other hand, why not simplify the implementation and bust the buffer in all cases? Maybe I'm too hung up on implementation and am unable to see IO#pos
as merely an attribute.
I also just installed the latest master branch build to check the changes, and there are still a couple of issues:
- It's still possible for
IO#pos
to return negative values:require 'tempfile' Tempfile.open do |f| f.write('0123456789') f.rewind f.ungetbyte(97) f.pos # => -1 end
- The character buffer isn't cleared when transcoding and seeking without first calling
IO#getc
:require 'tempfile' Tempfile.open(encoding: 'utf-8:utf-16le') do |f| f.write('0123456789') f.rewind f.ungetc('a'.encode('utf-16le')) # Character buffer will not be cleared f.seek(2, :SET) f.getc # => 'a'.encode('utf-16le'); should be '2'.encode('utf-16le') end
Updated by javanthropus (Jeremy Bopp) about 20 hours ago
I found another issue while looking for more sharp edges on this, and I've opened #20889 as a result.