Bug #16997

IO#gets converts some \r\n to \n with universal_newline: false

Added by scivola20 (sciv ola) 7 months ago. Updated 5 months ago.

Target version:
ruby -v:
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin17]


Reproduction code:

IO.binwrite "t.csv", ("a" * 100 + "\r\n") * 100"t.csv", encoding: "BOM|UTF-8", universal_newline: false) do |input|
  p input.gets(nil, 32 * 1024) # => "a...a\n...\na...a\r\n...\r\n"

It causes MalformedCSVError at opening CSV file with `encoding: "BOM|UTF-8":

Updated by jeremyevans0 (Jeremy Evans) 5 months ago

I'm able to reproduce this issue on Windows (ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x64-mingw32]), but not on OpenBSD (probably expected).

On Windows, this doesn't just affect IO#gets, it also affects IO#read and likely other IO methods for reading. From some testing, it appears that the first 8KB read have the \r\n -> \n newline translation performed, and it is specific to BOM|UTF-8, it doesn't happen with just UTF-8. 8KB happens to be IO_RBUF_CAPA_MIN. My guess is the initial 8KB gets buffered before the universal newline setting is applied runs due to the BOM detection. Assuming that is the issue, there may be a couple possible solutions:

  • Apply the universal newline setting before the BOM detection (seems best).
  • Clear the buffer after the BOM detection and set the current file position to directly after the BOM. The next read would then fill the buffer and hopefully work correctly.

Unfortunately, I don't have a Windows development environment for Ruby, so I can't currently do more than speculate.

Also available in: Atom PDF