Feature #6321

Find and repair bad bytes in encodings, without transcoding

Added by jonathan rochkind almost 2 years ago. Updated 12 months ago.

[webmaster@ruby-lang.org:<unknown>]
Status:Closed
Priority:Normal
Assignee:Yui NARUSE
Category:-
Target version:next minor

Description

If I use the String#encode feature to transcode from one encoding to another, then bad (invalid) bytes in the source encoding will raise, or else I can pass in :invalid and :replace options to tell it to do something different with bad bytes in the source encoding.

Sometimes I do not want to transcode to a new encoding. I have a string which, ought to be, say, UTF-8

string = something.force_encoding("UTF-8")

However, like all input from an external source that I don't have complete control over, it's possible that it contains invalid bytes. I'd like to check it right away, sometimes raising right away, sometimes using :invalid/:replace functionality similar to String#encode.

As far as I can tell, ruby gives me no way to do it. This does not work, it's a no-op even when there are invalid bytes:

string.encoding => UTF-8
string.encode("UTF-8") # Does NOT raise even if there are bad bytes
string.encode("UTF-8", :invalid => :replace) # Does NOT replace bad bytes

So this is a feature request for a built-in way to do this. It is actually a pretty common thing to want to do, sometimes strings come from external sources that are not want they claim they are; it's very useful to be able to check/validate them, and possibly repair them, right away, rather than waiting for an "invalid byte sequence" error to crop up at some indeterminate point in the future.

I don't know if this functionality should be provided by String#encode as above, even when the target encoding is the same as the destination encoding. Or if it needs to be a new method name, say #validate_encoding. Either way is fine with em.

Here's a pure-ruby partial implementation showing what I need, but it's not as full-featured as the relevant functions in #encode for trans-coding, and it's probably much much slower too. This ought to be built-in, and in C.

https://gist.github.com/2416043


Related issues

Related to ruby-trunk - Feature #6752: Replacing ill-formed subsequencce Closed 07/19/2012
Related to ruby-trunk - Bug #7967: String#encode invalid: :replace doesn't replace invalid c... Rejected 02/26/2013

Associated revisions

Revision 40391
Added by Yui NARUSE 12 months ago

Add example for String#scrub

[Feature #6321] [Feature #6752] [Bug #7967]

History

#1 Updated by Hiroshi Nakamura almost 2 years ago

Administrator, can you move this ticket to ruby-trunk?

#2 Updated by Shyouhei Urabe almost 2 years ago

  • Project changed from 9 to ruby-trunk
  • Priority changed from Normal to High

moved.

#3 Updated by Yusuke Endoh almost 2 years ago

  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE
  • ruby -v set to -

#4 Updated by Yui NARUSE almost 2 years ago

  • Tracker changed from Bug to Feature
  • Status changed from Assigned to Feedback

What is the use case?

I have an idea like String#validate which behaves as you said.
But I doubt such method introduces a vulnerability with unexpected behavior or some attacks.
Its correct behavior seems more difficult than we think.
Until the use case really needs such method, I think an application should simply raise error.

#5 Updated by jonathan rochkind almost 2 years ago

I think the use case is very common -- it is for me anyway, I think I"m not unique!

I am taking in input from an external source which I believe is UTF-8 (for example, could be any encoding). It is advertised as UTF8, as far as I know it is.

But, like all input, it can not be guaranteed reliable. It may have errors in it, or corrupt bytes. It may have been mis-entered at some point in history, or have had it's encoding mis-represented. Bad data exists.

Right now, I take in this data, and call force_encoding on it. There is no immediate exception. But at some indeterminate point in the future, an exception will be raised, virtually at any point in my program's execution. This is difficult to work with.

Instead, I want to take action immediately upon the string entering my program. Sometimes I'll want an exception to be raised right away, true. But sometimes instead, I'll want to replace invalid bytes with a replacement char -- it depends on the context of my application.

I could explain the particular context I am personally wanting to use replacement chars -- but I think it is a general thing people will want to do in a variety of contexts.

This is why String#encode includes the ":replace=>:invaid" option, right? Note that option in String#encode is for invalid bytes in the source, not for inability to transcode (:undef => :replace) is for that. If you might want to do this when transcoding (and indeed I think people often do), then why wouldn't you sometimes want to do it without transcoding too? (I think it is quite common).

Likewise, I am not familiar with the internal implementation of String#encode with :invalid => :replace. But if it's possible for String encode to identify invalid bytes in the source encoding when transcoding, and do it sufficiently correctly, it seems like it should be possible to do it without transcoding too. Step through all the bytes, do exactly whatever String#encode is doing to identify illegal bytes in source encoding, and just omit the subsequent transcode step.

The implemetnation in String#encode, but applied without transcoding, is exactly what would be appropriate for consistency.

When considering API, In addition to a validate method, perhaps these could be options on force_encoding instead, since the point you'll want to do this is almost always at the point you are force_encoding:

string.force_encoding(:invalid => :raise)   # since :ignore or `nil` are the defaults
string.force_encoding(:invalid => :replace)  # consistent arg with String#encode
string.force_encoding(:invalid => :replace, :replace => "*") # consistent arg with String#encode

#6 Updated by Yui NARUSE almost 2 years ago

jrochkind (jonathan rochkind) wrote:

I think the use case is very common -- it is for me anyway, I think I"m not unique!

I am taking in input from an external source which I believe is UTF-8 (for example, could be any encoding). It is advertised as UTF8, as far as I know it is.

But, like all input, it can not be guaranteed reliable. It may have errors in it, or corrupt bytes. It may have been mis-entered at some point in history, or have had it's encoding mis-represented. Bad data exists.

Right now, I take in this data, and call force_encoding on it. There is no immediate exception. But at some indeterminate point in the future, an exception will be raised, virtually at any point in my program's execution. This is difficult to work with.

Instead, I want to take action immediately upon the string entering my program. Sometimes I'll want an exception to be raised right away, true. But sometimes instead, I'll want to replace invalid bytes with a replacement char -- it depends on the context of my application.

Raising exception as soon as possible sounds reasonable.
But replacing is not.
In principle, such bad data should cause error.
I want the use case when you want not an error but replacing.

#7 Updated by Yui NARUSE almost 2 years ago

  • Priority changed from High to Normal

#8 Updated by jonathan rochkind almost 2 years ago

Raising exception as soon as possible sounds reasonable.
But replacing is not.
In principle, such bad data should cause error.
I want the use case when you want not an error but replacing.

Why do you support :invalid => :replace on a String#encode operation? It's exactly the same use cases. Note that :invalid => :replace on String#encode applies to bytes that were invalid in the original source encoding, not bytes that could not be transcoded to the destination encoding. Your argument would apply there too, and say this functionality should be removed there too.

But that is not what I'm suggesting! I think it's useful there, and it's useful here too.

What does your unix terminal under bash do, if it believes it's in UTF8 mode, and you ask it to cat a file to screen which includes bytes which are illegal UTF8? It does not refuse to display teh file -- it displays the file, replacing bad bytes with a replacement char.

What does vim do if you open up a file in UTF8 mode, which includes some bytes which are invalid in UTF8? It does not refuse to show you the file at all. It displays the file, replacing bad bytes with a replacement char.

This is a very common thing to do.

I have a lot of cases where I'd want to do it. In one of them, I am accessing a third party API. Everything the third party API returns is supposed to be UTF8. But sometimes it has illegal bytes in it anyway, due to various errors on their end. Yes, I want to log the error and let the third party provider know about it. But in the meantime, for my application, it's a much better failure mode to be able to use most of the API response, then to throw it out entirely because it had a couple bad bytes in it.

In another one, I'm reading files in from a legacy library (the kind with books) file format. The files I am reading in are supposed to have their textual payloads in UTF8 (after being translated from a legacy encoding not supported in ruby), but again, due to various problems upstream sometimes they have bad bytes in them. In these occasional failure cases, much of the file still has valid useful data in it -- it's just got a few bad bytes. Yes, I could check #validencoding? and simply throw out the file entirely. But it's much better for my application to recover from the bad bytes by replacing them with a replacement char, instead of refusing to process the file at all. At _least to them be able to tell the operator something about the file that had an error, displaying what can be displayed.

#9 Updated by jonathan rochkind almost 2 years ago

PS: I am working on simple not-quite-yet-released gem to do this, as best as I could figure out to do in pure ruby, for those who, like, me, do need it. https://github.com/jrochkind/ensure_valid_encoding

#10 Updated by jonathan rochkind almost 2 years ago

Another way to solve this, rather than add a new method, could be making String#encode with :invalid => :replace option work even when not changing encoding.

Right now:

# \xDF is not a valid byte in UTF-8
bad_bytes_in_utf8 = "abc\xDFf"

# if we are transcoding/converting to a different encoding, we can ask
# ruby to fix it:

fixed = bad_bytes_in_utf8.encode("iso-8859-1", :invalid => :replace)
# =>  "abc?f"  , invalid byte has been replaced with default replacement char
fixed.encoding
#=> #<Encoding:ISO-8859-1>

Okay, but what if we didn't want to change the encoding (which also would have transcoded other non-7-bit chars)?

bad_bytes_in_utf8.encoding   # => UTF-8
bad_bytes_in_utf8.encode("UTF-8", :invalid => :replace)
# => "abc\xDFf", it was a no-op, since we told ruby
# to convert from UTF-8 to UTF-8, it did nothing, the 
# :invalid => :replace option was irrelevant and ignored. 

You could instead fix encode from encoding to same encoding to
still respect :invalid => :replace, that would be another way
to provide API to accomplish this same thing.

However, I think the 'fail quick' option is also useful, but can
be done pretty easily in pure ruby right now. raise Whatever unless x.valid_encoding?

#11 Updated by Misty De Meo almost 2 years ago

I agree with jrochkind - I think this would be a very useful feature to have. It's not uncommon, when working with dirty source data, to have text which is almost but not quite correct. In circumstances where it's possible to be confident that the rest of the string are valid and useful, it would be good to have a simple method to replace invalid characters without changing encodings.

#12 Updated by Ravil Bayramgalin over 1 year ago

I have stumbled upon this issue too. The same use-case as above, I have untrusted utf-8 files which I need to fix. Currently there is a messy workaround with replacing invalid characters manually, but it would be so much better and more intuitive if #encode to the same encoding with method arguments would work instead of being no-op.

#13 Updated by Charles Nutter over 1 year ago

I ran into this and filed #7282. MRI allows bad byte sequences to parse as UTF-8, and then subsequent transcode to UTF-8 (no-op) does not catch them. As a result, they can propagate until they hit some method that actually does verify the bytes, like regexp matching or a non-no-op transcoding operation.

For the moment we will probably modify JRuby to also no-op same-encoding transcoding, but if MRI adds a mechanism for verifying a String's contents are properly encoded, we're happy to do the same.

#14 Updated by Yusuke Endoh over 1 year ago

  • Target version set to 2.0.0

#15 Updated by Erik Peterson over 1 year ago

naruse (Yui NARUSE) wrote:

What is the use case?

I have an idea like String#validate which behaves as you said.
But I doubt such method introduces a vulnerability with unexpected behavior or some attacks.
Its correct behavior seems more difficult than we think.
Until the use case really needs such method, I think an application should simply raise error.

Hopefully the given use-cases are sufficient to get this feature into Ruby 2.0, but here's one that I have:

In my applications, I process output that is produced by malware. This output is sometimes intentionally malformed to contain bad UTF-8 bytes. I still must be able to capture and analyze these strings, doing my best to work around the bad bytes. In my current applications there is a 10-line method which replaces badly-encoded bytes (using eachchar and validencoding? on that char), but this approach is fragile at best. It would be very nice to have the core language have an option to easily deal with malformed UTF-8 strings.

The solution to have ":invalid => :replace" perform the replacement, even if the source and destination encodings are the same, would be sufficient for me.

#16 Updated by Yui NARUSE about 1 year ago

  • Target version changed from 2.0.0 to next minor

#17 Updated by jonathan rochkind about 1 year ago

Turns out this is already built into stdlib, and has been in 1.9.3 too!

It took me nearly a year to realize it was, and hardly anyone seems aware of this! But it is.

# create a bad string, but in real life this would come in as
# input, you'd never intentionally create a bad string
a = "bad: \xc3\x28 okay".force_encoding("utf-8")  

# now replace bad bytes with replacement char
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

That's exactly what I needed. Or more generally:

 # replace bad bytes and make good, without trans-coding
 str.encode( str.encoding, "binary", :undef => :replace)  

You need to pass in "binary" as the "source encoding" to #encode, to get the stdlib to replace bad bytes without a trans-code. If you just do str.encode( str.encoding, :undef => :replace), it's always a no-op. You need to say the source encoding is "binary" to make it do what I want. Which actually doesn't make a whole lot of sense -- what does a transcode from 'binary' to an encoding mean? Well, it doesn't mean anything -- but in the stdlib, it means "okay, now i've been told to replace bad bytes".

So the functionality is already there, great!

I hope it's intentional and spec'd, and not accidental, and won't go away. It would be nice if it were documented, as this is a very little known feature.

#18 Updated by Yui NARUSE 12 months ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r40391.
jonathan, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


Add example for String#scrub

[Feature #6321] [Feature #6752] [Bug #7967]

Also available in: Atom PDF