Bug #14127
closed(CSV) generating UTF-16LE encoded file without BOM
Description
This file should contain BOM information so that it is properly detected as UTF-16LE file.
How to generate such file:
file = CSV.generate(encoding: 'UTF-16LE') do |csv|
csv << ['something', 'ľščťžýáíé']
end
According to file -I file.csv this file is recognized as application/octet-stream; charset=binary because it is missing the BOM information.
According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.
Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.
## Adds BOM, albeit in a somewhat hacky way.
new_html_file = File.open(foo.txt, "w:UTF-8")
new_html_file << "\xFF\xFE".force_encoding('utf-16le') + some_text.force_encoding('utf-8').encode('utf-16le')
Updated by nobu (Nobuyoshi Nakada) almost 8 years ago
laykou (Ladislav Gallay) wrote:
This file should contain BOM information so that it is properly detected as UTF-16LE file.
How to generate such file:
file = CSV.generate(encoding: 'UTF-16LE') do |csv| csv << ['something', 'ľščťžýáíé'] end
csv.rb seems having bugs in ASCII-incompatible encodings support.
According to
file -I file.csvthis file is recognized asapplication/octet-stream; charset=binarybecause it is missing the BOM information.According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.
CSV.generate just builds a CSV string, doesn't create a file.
Writing the result to a file with BOM is an application's responsibility.
CSV.open("utf16.csv", "w:UTF-16LE:utf-8") do |csv|
csv.to_io.write "\uFEFF"
csv << ['something', 'ľščťžýáíé']
end
Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.
new_html_file = File.open("foo.txt", "w:UTF-16LE")
new_html_file << "\uFEFF" << some_text
Updated by hsbt (Hiroshi SHIBATA) over 7 years ago
- Status changed from Open to Assigned
- Assignee set to kou (Kouhei Sutou)
Updated by kou (Kouhei Sutou) over 7 years ago
- Status changed from Assigned to Rejected
nobu almost said.
You should write BOM by yourself when you use CSV.generate.
If you don't want to write BOM by yourself, you should use CSV.open(..., "w:UTF-16"):
CSV.open("utf16.csv", "w:UTF-16:utf-8") do |csv|
csv << ['something', 'ľščťžýáíé']
end
But it generates big-endian UTF-16.
Updated by printercu (Max Melentiev) about 7 years ago
WDYT about adding file_header option or something like this?
It's quite tricky to add this in streaming mode:
CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
bom_written = false
for_each_row do |row|
unless bom_written
csv.to_io.write(BOM)
bom_written = true
end
csv << row
end
end
Updated by kou (Kouhei Sutou) about 7 years ago
Why do you need to use bom_written?
CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
csv.to_io.write(BOM)
for_each_row do |row|
csv << row
end
end
Updated by printercu (Max Melentiev) about 7 years ago
It has different behaviour. In my example file is empty if csv.<< is never called, in suggested example it contains BOM anyway.