Bug #21765
openstop using the C runtime _read() on Windows
Description
When creating an IO instance in Windows, the default data mode is text mode.
In reality, the IO encoding conversion mechanism is not used when encoding conversion is not performed. Instead, the CRLF conversion provided by the C runtime's _read() is used.
This is explicitly for speed.
https://bugs.ruby-lang.org/issues/6401#note-4
As a trade-off, SET_BINARY_MODE(fptr) and SET_BINARY_MODE_WITH_SEEK_CUR(fptr) are used in various places within io.c, altering the state of the file descriptor.
This made the flow of operations difficult to understand and changes hard to implement, especially for developers on other platforms.
Additionally, the issues I recently reported were discovered while verifying the impact of modifying the CRLF conversion to utilize the encoding conversion mechanism.
#21691 On Windows some of binary read functions of IO are not functional
#21687 IO#pos goes wrong after EOF character(ctrl-z) met
#21634 Combining read(1) with eof? causes dropout of results unexpectedly on Windows.
These issues arise because data read into the rbuf does not match the stream due to newline conversion, or because the buffer end and file position do not align when CTRLZ is detected.
As a fix for Bug #21687, I created PR #15216. However, this relies on the internal behavior of the C runtime's _read() function, and it seems there is no way to avoid this dependency.
I propose removing the use of C runtime _read().
Reason for Proposal
- The mismatch between rbuf and stream contents complicates io_unread() and makes maintenance difficult.
- Changing the O_BINARY/O_TEXT state of the file descriptor in various places hinders understanding of the behavior and makes modifications difficult.
Two methods to remove C runtime _read() while maintaining current behavior
- Interpret CRLF and CTRLZ when reading rbuf within io.c.
- Interpret CRLF and CTRLZ within the encoding conversion framework.
My initial idea was to implement the second, using encoding conversion.
However, this internally changes the read operation from rbuf to cbuf, resulting in a change to the behavior of ungetc.
The proposal in Bug #21682 attempted to generalize this change to minimize its impact.
https://bugs.ruby-lang.org/issues/21682
This issue proposes the first method, crlf conversion during rbuf read.
Problems caused by inconsistencies between the rbuf and stream contents are avoided, and io_unread() becomes the same as on other platforms.
Compared to implementing it as an encoding conversion, the advantage is that there is no change in behavior.
On the other hand, since each read method in io.c requires individual handling, using encoding conversion results in more localized changes.