Project

General

Profile

Actions

Feature #10740

closed

Base64 urlsafe methods are not urlsafe

Added by dragonsinth (Scott Blum) almost 10 years ago. Updated over 9 years ago.

Status:
Closed
Target version:
-
[ruby-core:67570]

Description

Base64.urlsafe_decode64 is not to spec, because it currently REQUIRES appropriate trailing '=' characters.
Base64.urlsafe_encode64 produces trailing '=' characters.

'=' is not web safe, and is not recommended for base64url. Some specs even disallow.

Suggested fix:

  # Returns the Base64-encoded version of +bin+.
  # This method complies with ``Base 64 Encoding with URL and Filename Safe
  # Alphabet'' in RFC 4648.
  # The alphabet uses '-' instead of '+' and '_' instead of '/'
  # and has no trailing pad characters.
  def urlsafe_encode64(bin)
    strict_encode64(bin).tr("+/", "-_").tr('=', '')
  end

  # Returns the Base64-decoded version of +str+.
  # This method complies with ``Base 64 Encoding with URL and Filename Safe
  # Alphabet'' in RFC 4648.
  # The alphabet uses '-' instead of '+' and '_' instead of '/'.
  # Trailing pad characters are optional.
  def urlsafe_decode64(str)
    str = str.tr("-_", "+/")
    str = str.ljust((str.length + 3) & ~3, '=')
    strict_decode64(str)
  end

Files

base64-urlsafe-encode64-search-result.txt (19.9 KB) base64-urlsafe-encode64-search-result.txt akr (Akira Tanaka), 01/14/2015 12:44 AM
urlsafe_base64.patch (2.97 KB) urlsafe_base64.patch mame (Yusuke Endoh), 01/16/2015 01:16 PM

Updated by dragonsinth (Scott Blum) almost 10 years ago

Note that SecureRandom.urlsafe_base64 does the right thing by default, with the note "By default, padding is not generated because "=" may be used as a URL delimiter."

Actions #3

Updated by bascule (Tony Arcieri) almost 10 years ago

I ran into this problem trying to implement RFC6920 in this program:

https://github.com/cryptosphere/cryptor/blob/master/lib/cryptor/encoding.rb#L20

RFC6920 says:

Digest Value:  The digest value MUST be encoded using the base64url
   [RFC4648] encoding, with no "=" padding characters.

RFC4648 (which defines URL-safe Base64) says the following:

Implementations MUST include appropriate pad characters at the end of
encoded data unless the specification referring to this document
explicitly states otherwise.

RFC6920 explicitly says that the padding characters should be removed from the URL-safe Base64 serialization. Per RFC4648, this is allowed, since it is explicitly specified that way in RFC6920.

Updated by akr (Akira Tanaka) almost 10 years ago

I like this feature.
(I think this issue is a feature, not a bug.)

However I think the current behavior should be choosable for compatibility.

I searched Base64.urlsafe_encode64 in gems: base64-urlsafe-encode64-search-result.txt
Not all use removes "=".
I guess some will have problem if we change the behavior.

Updated by mame (Yusuke Endoh) almost 10 years ago

  • Tracker changed from Bug to Feature
  • Status changed from Open to Feedback
  • Assignee set to mame (Yusuke Endoh)

Hello, I'm a maintainer of lib/base64.

I don't think that this is a bug. RFC 4648 is still the latest standard of Base64. (Note that RFC 6920 does not obsolete RFC 4648.) Because lib/base64 is an implementation of Base64, it should comply with RFC 4648, at least, by default. Moving to the feature tracker.

I found Python's ticket about the same issue: http://bugs.python.org/issue1661108
They decided to follow the spec, as-is, even though it looks broken. I respect them.

That being said, I understand that the current behavior is not useful for some people. I don't think it is a good idea to change the behavior because of compatibility issue (as akr said), but I'm happy to add something like "no padding" option. However, RFC 4648 also says:

The pad character "=" is typically percent-encoded when used in an
URI [9], but if the data length is known implicitly, this can be
avoided by skipping the padding; see section 3.2.

I have no idea what it is talking about; the data length is known with or without padding. But spec is spec. According to it, I think urlsafe_decode64 must receive the data length argument. I have no idea how the method should handle the argument, though ;-( I'm unsure if this is a right direction.

Related discussion: http://stackoverflow.com/questions/4080988/why-does-base64-encoding-requires-padding-if-the-input-length-is-not-divisible-b

So, I'm uncertain what to do. Any idea?

--
Yusuke Endoh

Updated by bascule (Tony Arcieri) almost 10 years ago

Hi Yusuke,

RFC6920 is just an example of an RFC which refers to RFC4648 and stipulates that something encoded in base64url MUST NOT be padded. According to RFC4648 this is allowed.

Specifically in the case of RFC6920, the data length is known implicitly because we are parsing the data out of a URI.

I don't think there is a need to pass the length in as a parameter. I just think that Base64.urlsafe_decode64 should tolerate unpadded inputs.

Updated by dragonsinth (Scott Blum) almost 10 years ago

I suspect the reason the spec is that way is that it's easier to calculate what the decoded length will be if the encoding is always divisible by 4, since it's just (encoded_len / 4) * 3. It makes more since in the context of wire protocols such as email MIME where base64 originally came from. In a language like Ruby where strings have lengths the data length is always known so I suspect it's less relevant.

It is worth noting that SecureRandom.urlsafe_base64 has an optional padding parameter which defaults to false. I think ideally we should follow that example, and default to no padding on the encode side. But if that's too risky we could default padding to false to maintain the current behavior.

On the decoding side, it seems like a no-brainer to be lenient and fill in the proper padding. Otherwise, you have the bizarre situation where:

Base64.urlsafe_decode64(SecureRandom.urlsafe_base64(len) # raises if len % 3 != 0

Updated by mame (Yusuke Endoh) almost 10 years ago

My point is so simple: lib/base64 should comply with RFC 4648 as far as possible. Please explain your proposal based on RFC 4648 instead of RFC 6920 (that is NOT a spec of Base64), the behavior of the other libraries, etc. If you think RFC 4648 is unreasonable, please tell it to IETF.

Tony Arcieri wrote:

According to RFC4648 this is allowed.

I know. RFC 6290 makes such an exception. But there is no reason why THIS library does so. Note that this library is general-purpose, not for a specific use case such as an URL.

Scott Blum wrote:

Otherwise, you have the bizarre situation where:

Base64.urlsafe_decode64(SecureRandom.urlsafe_base64(len) # raises if len % 3 != 0

The situation itself is unfortunate.

I noticed that RFC 4648 does not mention the case where the padding lacks. It just says that the library MAY ignore extra paddings, though.

If more than the allowed number
of pad characters is found at the end of the string (e.g., a base 64
string terminated with "==="), the excess pad characters MAY also be
ignored.

So, it might be acceptable to tolerate unpadded input. Of course, we must still care about a compatibility issue.

--
Yusuke Endoh

Updated by bascule (Tony Arcieri) almost 10 years ago

Hi Yusuke,

Perhaps I have introduced confusion by talking about two different RFCs. RFC4648 is the only RFC I care about. I mentioned the other just as an example, because the language in RFC4648 is about how other RFCs can define standards in terms of RFC4648.

The specific text in RFC4648 is here:

"Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise."

There is a very specific allowance in RFC4648 to support unpadded base64url encoding for any RFC describing a portion of its specification in terms of base64url encoding as described in RFC4648, but chooses to use unpadded base64url encoding instead of padded base64url encoding. These other RFCs are not defining their own version of base64url encoding. Rather, they are using it in a way that's allowed by RFC4648, and RFC4648 specifically says to refer to RFC4648 when using it from another RFC in an unpadded way.

My interpretation of RFC4648 would suggest this behavior:

  • Base64.urlsafe_encode64(bin) should produce padded output like it does today
  • Base64.urlsafe_decode64(str) should work on both padded and unpadded inputs, because RFC4648 allows other RFCs that implement RFC4648-compliant base64url encoding to explicitly stipulate that there is no padding. RFC4648 specifically makes an exemption for this and it must be supported.

Updated by mame (Yusuke Endoh) almost 10 years ago

Tony Arcieri wrote:

My interpretation of RFC4648 would suggest this behavior:

Base64.urlsafe_encode64(bin) should produce padded output like it does today
Base64.urlsafe_decode64(str) should work on both padded and unpadded inputs,

Thank you, sounds reasonable. I like the behavior of Java's Base64.Decoder:

https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Decoder.html

The Base64 padding character '=' is accepted and interpreted as the end of the encoded byte data, but is not required. So if the final unit of the encoded byte data only has two or three Base64 characters (without the corresponding padding character(s) padded), they are decoded as if followed by padding character(s). If there is a padding character present in the final unit, the correct number of padding character(s) must be present, otherwise IllegalArgumentException ( IOException when reading from a Base64 stream) is thrown during decoding.

How about this?

   # This method complies with ``Base 64 Encoding with URL and Filename Safe
   # Alphabet'' in RFC 4648.
   # The alphabet uses '-' instead of '+' and '_' instead of '/'.
+  # Note that the result can still contain '='.
+  # You can remove the padding by setting "padding" as false.
+  def urlsafe_encode64(bin, padding: true)
+    str = strict_encode64(bin).tr("+/", "-_")
+    str = str.delete("=") unless padding
+    str
   end

   # Returns the Base64-decoded version of +str+.
   # This method complies with ``Base 64 Encoding with URL and Filename Safe
   # Alphabet'' in RFC 4648.
   # The alphabet uses '-' instead of '+' and '_' instead of '/'.
+  #
+  # The padding characters are optional.
+  # This method accepts both correctly-padded and unpadded input.
+  # Note that it still rejects incorrectly-padded input.
+  def urlsafe_decode64(str)
+    str = str.tr("-_", "+/")
+    if !str.end_with?("=") && str.length % 4 != 0
+      str = str.ljust((str.length + 3) & ~3, "=")
+    end
+    strict_decode64(str)
   end

Off topic:

because RFC4648 allows other RFCs that implement RFC4648-compliant base64url encoding to explicitly stipulate that there is no padding.

RFC 4648 says that the encoder MUST NOT add line feeds, unless bla bla:

Implementations MUST NOT add line feeds to base-encoded data unless
the specification referring to this document explicitly directs base
encoders to add line feeds after a specific number of characters.

Also, it says that the decoder MUST reject the input containing line feeds, unless bla bla:

Implementations MUST reject the encoded data if it contains
characters outside the base alphabet when interpreting base-encoded
data, unless the specification referring to this document explicitly
states otherwise.

RFC4648-compliant encoder WITH the exemption emits a data with line feed, and RFC4648-compliant decoder WITHOUT the exemption rejects the emitted data. Which is broken? IMO, RFC 4648 is broken ;-)

--
Yusuke Endoh

Updated by bascule (Tony Arcieri) almost 10 years ago

That looks good to me, thank you!

Updated by dragonsinth (Scott Blum) almost 10 years ago

That looks awesome. I'll update my PR.

Updated by dragonsinth (Scott Blum) almost 10 years ago

Updated https://github.com/ruby/ruby/pull/815 and merged in changes from Yusuke Endoh

Updated by nobu (Nobuyoshi Nakada) almost 10 years ago

Why does urlsafe_decode64 use strict_decode64, but not just unpack("m")?

Updated by mame (Yusuke Endoh) almost 10 years ago

Nobuyoshi Nakada wrote:

Why does urlsafe_decode64 use strict_decode64, but not just unpack("m")?

unpack("m") and Base64.decode64 are based on RFC 2045. unpack("m0"), Base64.strict_decode64, and Base64.urlsafe_decode64 (base64url) are based on RFC 4648.

RFC 2045 allows characters outside the base alphabet, such as CR and LF, and RFC 4648 does not (by default).

--
Yusuke Endoh

Updated by mame (Yusuke Endoh) almost 10 years ago

  • Status changed from Feedback to Assigned

Thank you all. I'll commit the patch in a few days unless there is objection.

--
Yusuke Endoh

Actions #18

Updated by mame (Yusuke Endoh) almost 10 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

Applied in changeset r49585.


  • lib/base64.rb: make urlsafe mode user-friendly.

  • lib/base64.rb (Base64.urlsafe_encode64): a new option "padding" to
    suppress the padding character ("=").

  • lib/base64.rb (Base64.urlsafe_decode64): now it accepts not only
    correctly-padded input but also unpadded input.
    [Feature #10740][ruby-core:67570]

  • test/base64/test_base64.rb: Test for above

Updated by mame (Yusuke Endoh) almost 10 years ago

Sorry for the late action, I've committed the patch. Thank you!

--
Yusuke Endoh

Actions #21

Updated by headius (Charles Nutter) over 9 years ago

Will this be merged to 2.2?

JRuby issue blocked on merge to 2.2: https://github.com/jruby/jruby/issues/2815

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0