Project

General

Profile

Actions

Feature #18822

open

Ruby lack a proper method to percent-encode strings for URIs (RFC 3986)

Added by byroot (Jean Boussier) 2 months ago. Updated about 22 hours ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108822]

Description

Context

There are two fairly similar encoding methods that are often confused.

application/x-www-form-urlencoded which is how form data is encoded, and "percent-encoding" as defined by RFC 3986.

AFAIK, the only way they differ is that "form encoding" escape space characters as +, and RFC 3986 escape them as %20. Most of the time it doesn't matter, but sometimes it does.

Ruby form and URL escape methods

  • URI.escape(" ") # => "%20" but it was deprecated and removed (in 3.0 ?).
  • ERB::Util.url_encode(" ") # => "%20" but it's implemented with a gsub and isn't very performant. It's also awkward to have to reach for ERB
  • CGI.escape(" ") # => "+"
  • URI.encode_www_form_component(" ") # => "+"

Unescape methods

For unescaping, it's even more of a clear cut since URI.unescape was removed. So there's no available method that won't treat an unescaped + as simply +.

e.g. in Javascript: decodeURIComponent("foo+bar") #=> "foo+bar".

If one were to use CGI.unescape, the string might be improperly decoded: GI.unescape("foo+bar") #=> "foo bar".

Other languages

  • Javascript encodeURI and encodeURIComponent use %20.
  • PHP has urlencode using + and rawurlencode using %20.
  • Python has urllib.parse.quote using %20 and urllib.parse.quote_plus using +.

Proposal

Since CGI already have a very performant encoder for application/x-www-form-urlencoded, I think it would make sense that it would provide another method for RFC3986.

I propose:

  • CGI.url_encode(" ") # => "%20"
  • Or CGI.encode_url.
  • Alias CGI.escape as GCI.encode_www_form_component
  • Clarify the documentation of CGI.escape.

Updated by byroot (Jean Boussier) 2 months ago

  • Description updated (diff)

I forgot to mention that we'd need the decode method as well.

Actions #2

Updated by byroot (Jean Boussier) 2 months ago

Proposed implementation: https://github.com/ruby/cgi/pull/26

Updated by ioquatix (Samuel Williams) 2 months ago

This looks good to me and I think it's a good addition.

In the past, I've referred to this operation as "URI.encode_path" as it seems specifically related to how paths are created and escaped. However, I'd be interested to know if there is a better interpretation of that RFC w.r.t. the naming of the operation.

The counter point is, "url_encode" seems a little confusing, because it sounds like it's encoding a "URL". One could imagine an interface like url_encode(scheme, user, password, host, port, path, query, fragment).

Updated by mame (Yusuke Endoh) 7 days ago

We discussed this issue at the dev meeting. How about the following?

  • Introduce CGI.escapeURIComponent(str) that behaves like CGI.escape, except that a space is encoded as %20 instead of + (as @byroot (Jean Boussier) proposed)
  • Introduce CGI.unescapeURIComponent(str) that is a reverse operation.
  • Introduce two aliases like CGI.escape_uri_component(str)
  • Do not introduce CGI.encode_www_form_component (but improvement of the rdoc of CGI.escape is welcome)

(There was a very long discussion, but I didn't understand it due to my lack of knowledge. Please see the dev-meeting-log.)

Updated by byroot (Jean Boussier) 7 days ago

How about the following?

Sounds good to me at first sight. I can work on a patch for it soon, if I notice any issue when implementing it I'll report it back here.

And thank you for the meeting notes, it's incredibly useful or me.

Updated by byroot (Jean Boussier) 5 days ago

I forgot I already had a PR, I updated it to match the spec that was accepted during the meeting: https://github.com/ruby/cgi/pull/26

Updated by sam.saffron (Sam Saffron) about 22 hours ago

+1, for context a similar issue we recently hit:

https://github.com/sporkmonger/addressable/issues/472

Turns out tons of projects these days rely on addressable for a more complete API, it would be nice only to need to lean on it in extreme outlier cases.

I wonder if there are other surfaces of addressable which should be pulled into core, at Discourse we lean on this:

https://github.com/discourse/discourse/blob/0df1c4eab2e1a15cd2414e88265fb9be329ac00b/lib/url_helper.rb#L21-L66

I think it is important to have "enough" tools in core MRI to deal with URLS such as https://ko.wikipedia.org/wiki/위%20/?abc%3A and somehow get a canonical URL out of it, just like web browsers are able to "figure out the right thing to do"

Actions

Also available in: Atom PDF