Project

General

Profile

Actions

Feature #18822

closed

Ruby lack a proper method to percent-encode strings for URIs (RFC 3986)

Added by byroot (Jean Boussier) over 2 years ago. Updated about 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:108822]

Description

Context

There are two fairly similar encoding methods that are often confused.

application/x-www-form-urlencoded which is how form data is encoded, and "percent-encoding" as defined by RFC 3986.

AFAIK, the only way they differ is that "form encoding" escape space characters as +, and RFC 3986 escape them as %20. Most of the time it doesn't matter, but sometimes it does.

Ruby form and URL escape methods

  • URI.escape(" ") # => "%20" but it was deprecated and removed (in 3.0 ?).
  • ERB::Util.url_encode(" ") # => "%20" but it's implemented with a gsub and isn't very performant. It's also awkward to have to reach for ERB
  • CGI.escape(" ") # => "+"
  • URI.encode_www_form_component(" ") # => "+"

Unescape methods

For unescaping, it's even more of a clear cut since URI.unescape was removed. So there's no available method that won't treat an unescaped + as simply +.

e.g. in Javascript: decodeURIComponent("foo+bar") #=> "foo+bar".

If one were to use CGI.unescape, the string might be improperly decoded: GI.unescape("foo+bar") #=> "foo bar".

Other languages

  • Javascript encodeURI and encodeURIComponent use %20.
  • PHP has urlencode using + and rawurlencode using %20.
  • Python has urllib.parse.quote using %20 and urllib.parse.quote_plus using +.

Proposal

Since CGI already have a very performant encoder for application/x-www-form-urlencoded, I think it would make sense that it would provide another method for RFC3986.

I propose:

  • CGI.url_encode(" ") # => "%20"
  • Or CGI.encode_url.
  • Alias CGI.escape as GCI.encode_www_form_component
  • Clarify the documentation of CGI.escape.

Updated by byroot (Jean Boussier) over 2 years ago

  • Description updated (diff)

I forgot to mention that we'd need the decode method as well.

Updated by ioquatix (Samuel Williams) over 2 years ago

This looks good to me and I think it's a good addition.

In the past, I've referred to this operation as "URI.encode_path" as it seems specifically related to how paths are created and escaped. However, I'd be interested to know if there is a better interpretation of that RFC w.r.t. the naming of the operation.

The counter point is, "url_encode" seems a little confusing, because it sounds like it's encoding a "URL". One could imagine an interface like url_encode(scheme, user, password, host, port, path, query, fragment).

Updated by mame (Yusuke Endoh) over 2 years ago

We discussed this issue at the dev meeting. How about the following?

  • Introduce CGI.escapeURIComponent(str) that behaves like CGI.escape, except that a space is encoded as %20 instead of + (as @byroot (Jean Boussier) proposed)
  • Introduce CGI.unescapeURIComponent(str) that is a reverse operation.
  • Introduce two aliases like CGI.escape_uri_component(str)
  • Do not introduce CGI.encode_www_form_component (but improvement of the rdoc of CGI.escape is welcome)

(There was a very long discussion, but I didn't understand it due to my lack of knowledge. Please see the dev-meeting-log.)

Updated by byroot (Jean Boussier) over 2 years ago

How about the following?

Sounds good to me at first sight. I can work on a patch for it soon, if I notice any issue when implementing it I'll report it back here.

And thank you for the meeting notes, it's incredibly useful or me.

Updated by byroot (Jean Boussier) over 2 years ago

I forgot I already had a PR, I updated it to match the spec that was accepted during the meeting: https://github.com/ruby/cgi/pull/26

Updated by sam.saffron (Sam Saffron) over 2 years ago

+1, for context a similar issue we recently hit:

https://github.com/sporkmonger/addressable/issues/472

Turns out tons of projects these days rely on addressable for a more complete API, it would be nice only to need to lean on it in extreme outlier cases.

I wonder if there are other surfaces of addressable which should be pulled into core, at Discourse we lean on this:

https://github.com/discourse/discourse/blob/0df1c4eab2e1a15cd2414e88265fb9be329ac00b/lib/url_helper.rb#L21-L66

I think it is important to have "enough" tools in core MRI to deal with URLS such as https://ko.wikipedia.org/wiki/위%20/?abc%3A and somehow get a canonical URL out of it, just like web browsers are able to "figure out the right thing to do"

Updated by sam.saffron (Sam Saffron) about 2 years ago

Since we just finished working around a nightmare scenario here @byroot (Jean Boussier), I think it is rather instructive to see a real world problem

The problem:

You get something, that is probably a URL from somewhere and need to be able to make requests to it.

  • It can have a unicode domain that needs to run through an IDN converter
  • It can have unicode chars that need percent encoding
  • It can be unescaped, or it can be escaped (and in weird cases part escaped)

Ideally you want to normalize as well, so caching is "stronger" and does not break for identical URLs. (following the general guidelines in the RFC)

So we ended up with this monster and travesty, partly powered by URI in MRI, partly powered by addressable, 100% hack.

https://github.com/discourse/discourse/blob/main/lib/url_helper.rb#L72-L105

Updated by byroot (Jean Boussier) about 2 years ago

It's mostly just waiting for review. I'm not certain who's the maintainer though. @hsbt (Hiroshi SHIBATA) @nobu (Nobuyoshi Nakada) I added you both for review, but if someone is more suitable please let me know.

Updated by ioquatix (Samuel Williams) about 2 years ago

I'm positive on this feature, but I'm negative on the naming convention being introduced. Are we going to add aliases?

Updated by byroot (Jean Boussier) about 2 years ago

Ah right, I forgot to add the alias that were agreed upon. I'll open a followup: GCI.escape_uri_component and GCI.unescape_uri_component.

Updated by ioquatix (Samuel Williams) about 2 years ago

Should we also add aliases for escape_html and so on?

Updated by byroot (Jean Boussier) about 2 years ago

Maybe, but that would need approval. If you feel strongly about it please open a dedicated issue.

Actions #15

Updated by byroot (Jean Boussier) about 2 years ago

  • Status changed from Open to Closed

Applied in changeset git|3850113e20b8c031529fc79de7202f61604425dd.


[ruby/cgi] Implement CGI.url_encode and CGI.url_decode

[Feature #18822]

Ruby is somewhat missing an RFC 3986 compliant escape method.

https://github.com/ruby/cgi/commit/c2729c7f33

Updated by sam.saffron (Sam Saffron) about 2 years ago

@byroot (Jean Boussier),

I am not sure the name is right here:

CGI.path_encode

with an alias of

CGI.params_encode

is far more correct.

Cause as it stands:

CGI.url_encode("https://i❤️.ws/❤️?test=❤️") will return an incorrect result.

CGI.url_encode("https://i❤️.ws") should return https://xn--i-7iq.ws/

Alternatively ... we "fix" url_encode?

Should I open a new ticket?

Updated by byroot (Jean Boussier) about 2 years ago

@sam.saffron (Sam Saffron) it's my fault for forgetting to update the commit message. CGI.url_encode was never implemented, what was is CGI.escapeURIComponent.

Updated by noraj (Alexandre ZANNI) about 1 year ago

I just want to complete what was said before.

URI.escape and URI.unescape were deprecated but they were replaced by URI::Parser.new.escape and URI::Parser.new.unescape that implements RFC 2396. In fact this is calling URI::RFC2396_Parser.escape and URI::RFC2396_Parser.unescape.

But it's not useless since RFC 2396 was a Draft Standard and was obsoleted and updated by RFC 3986 which is an Internet Standard as CGI.escapeURIComponent and CGI.unescapeURIComponent implements RFC 3986.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0