Project

General

Profile

Actions

Bug #16997

open

IO#gets converts some \r\n to \n with universal_newline: false

Added by scivola20 (sciv ola) over 4 years ago. Updated over 4 years ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin17]
[ruby-core:98989]

Description

Reproduction code:

IO.binwrite "t.csv", ("a" * 100 + "\r\n") * 100
File.open("t.csv", encoding: "BOM|UTF-8", universal_newline: false) do |input|
  p input.gets(nil, 32 * 1024) # => "a...a\n...\na...a\r\n...\r\n"
end

It causes MalformedCSVError at opening CSV file with `encoding: "BOM|UTF-8":
https://github.com/ruby/csv/issues/147

Updated by jeremyevans0 (Jeremy Evans) over 4 years ago

I'm able to reproduce this issue on Windows (ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x64-mingw32]), but not on OpenBSD (probably expected).

On Windows, this doesn't just affect IO#gets, it also affects IO#read and likely other IO methods for reading. From some testing, it appears that the first 8KB read have the \r\n -> \n newline translation performed, and it is specific to BOM|UTF-8, it doesn't happen with just UTF-8. 8KB happens to be IO_RBUF_CAPA_MIN. My guess is the initial 8KB gets buffered before the universal newline setting is applied runs due to the BOM detection. Assuming that is the issue, there may be a couple possible solutions:

  • Apply the universal newline setting before the BOM detection (seems best).
  • Clear the buffer after the BOM detection and set the current file position to directly after the BOM. The next read would then fill the buffer and hopefully work correctly.

Unfortunately, I don't have a Windows development environment for Ruby, so I can't currently do more than speculate.

Actions

Also available in: Atom PDF

Like0
Like0