Bug #11410
closedWin32 Registry enumeration performs unnecessary string re-encoding which cause UndefinedConversionError exceptions
Description
When enumerating keys with Win32::Registry#each_key
/ Win32::Registry#keys
or values with Win32::Registry#each_value
/ Win32::Registry#values
, Ruby will take a UTF-16LE
string returned from the Windows API and convert it to the local codepage. In the case of each_value
, the string is then immediately converted back to UTF16-LE
before being used in subsequent Windows API calls. Not only is this conversion unnecessary, but it may cause encoding exceptions when the local codepage does not support all of the characters present in the original Unicode string.
One such example of this is when a Unicode en-dash U+2013
appears in a string, and the local codepage is IBM437
, which has no equivalent character. But this is just one of many examples that may trigger this behavior.
[1] pry(main)> RUBY_VERSION
=> "2.1.5"
[2] pry(main)> ENDASH_UTF_16 = [0x2013]
=> [8211]
[3] pry(main)> utf_16_str = ENDASH_UTF_16.pack('s*').force_encoding(Encoding::UTF_16LE)
=> "\u2013"
[4] pry(main)> utf_16_str.encode(Encoding::IBM437)
Encoding::UndefinedConversionError: U+2013 to IBM437 in conversion from UTF-16LE to UTF-8 to IBM437
from (pry):4:in `encode'
NOTE: Normal registry reads of a value at a particular key are not problematic - the bad behavior is triggered specifically during enumeration.
This is primarily as a result of the export_string
function which re-encodes strings
https://github.com/ruby/ruby/blob/ruby_2_1/ext/win32/lib/win32/registry.rb#L894-L896
It is used by each_value
and each_key
, which return UTF-16LE
strings:
https://github.com/ruby/ruby/blob/ruby_2_1/ext/win32/lib/win32/registry.rb#L561
https://github.com/ruby/ruby/blob/ruby_2_1/ext/win32/lib/win32/registry.rb#L598
In the each_value
method, this LOCALE re-encoded string is then passed to the read
method, where it is turned back into a UTF16-LE
string to be passed to RegQueryValueExW
https://github.com/ruby/ruby/blob/v2_1_5/ext/win32/lib/win32/registry.rb#L563
https://github.com/ruby/ruby/blob/v2_1_5/ext/win32/lib/win32/registry.rb#L631
https://github.com/ruby/ruby/blob/v2_1_5/ext/win32/lib/win32/registry.rb#L307
Inside Puppet, we employed a solution that avoids Ruby's Win32::Registry
when performing enumeration, and relies on internal helpers instead (avoiding unnecessary string encodings). This was unfortunate, but necessary:
https://github.com/puppetlabs/puppet/commit/c610cd01eeef3fafa7aa2761a3435dd6c1b0d8d4
Note also that we typically convert UTF-16LE
strings to UTF-8
internally (since this is almost always guaranteed to be a lossless conversion), until we reach an end-user boundary where they absolutely need a specific encoding rendered. For instance, our version of read
converts to UTF8
:
https://github.com/puppetlabs/puppet/blob/c610cd01eeef3fafa7aa2761a3435dd6c1b0d8d4/lib/puppet/util/windows/registry.rb#L211-L214
I suggest that other locations where strings are re-encoded be examined for potential issues, as locale codepage conversions are generally considered dangerous given Win32 APIs use UTF-16LE
.