Feature #18822: Ruby lack a proper method to percent-encode strings for URIs (RFC 3986) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #18822

closed

Ruby lack a proper method to percent-encode strings for URIs (RFC 3986)

Added by byroot (Jean Boussier) about 3 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

[ruby-core:108822]

Description

Context¶

There are two fairly similar encoding methods that are often confused.

application/x-www-form-urlencoded which is how form data is encoded, and "percent-encoding" as defined by RFC 3986.

AFAIK, the only way they differ is that "form encoding" escape space characters as +, and RFC 3986 escape them as %20. Most of the time it doesn't matter, but sometimes it does.

Ruby form and URL escape methods¶

URI.escape(" ") # => "%20" but it was deprecated and removed (in 3.0 ?).
ERB::Util.url_encode(" ") # => "%20" but it's implemented with a gsub and isn't very performant. It's also awkward to have to reach for ERB
CGI.escape(" ") # => "+"
URI.encode_www_form_component(" ") # => "+"

Unescape methods¶

For unescaping, it's even more of a clear cut since URI.unescape was removed. So there's no available method that won't treat an unescaped + as simply +.

e.g. in Javascript: decodeURIComponent("foo+bar") #=> "foo+bar".

If one were to use CGI.unescape, the string might be improperly decoded: GI.unescape("foo+bar") #=> "foo bar".

Other languages¶

Javascript encodeURI and encodeURIComponent use %20.
PHP has urlencode using + and rawurlencode using %20.
Python has urllib.parse.quote using %20 and urllib.parse.quote_plus using +.

Proposal¶

Since CGI already have a very performant encoder for application/x-www-form-urlencoded, I think it would make sense that it would provide another method for RFC3986.

I propose:

CGI.url_encode(" ") # => "%20"
Or CGI.encode_url.
Alias CGI.escape as GCI.encode_www_form_component
Clarify the documentation of CGI.escape.

Actions

Copy link

#1 [ruby-core:108829]

Updated by byroot (Jean Boussier) about 3 years ago

Description updated (diff)

I forgot to mention that we'd need the decode method as well.

Actions

Copy link

Updated by byroot (Jean Boussier) about 3 years ago

Proposed implementation: https://github.com/ruby/cgi/pull/26

Actions

Copy link

#3 [ruby-core:108836]

Updated by ioquatix (Samuel Williams) about 3 years ago

This looks good to me and I think it's a good addition.

In the past, I've referred to this operation as "URI.encode_path" as it seems specifically related to how paths are created and escaped. However, I'd be interested to know if there is a better interpretation of that RFC w.r.t. the naming of the operation.

The counter point is, "url_encode" seems a little confusing, because it sounds like it's encoding a "URL". One could imagine an interface like url_encode(scheme, user, password, host, port, path, query, fragment).

Actions

Copy link

#4 [ruby-core:109412]

Updated by mame (Yusuke Endoh) about 3 years ago

We discussed this issue at the dev meeting. How about the following?

Introduce CGI.escapeURIComponent(str) that behaves like CGI.escape, except that a space is encoded as %20 instead of + (as @byroot (Jean Boussier) proposed)
Introduce CGI.unescapeURIComponent(str) that is a reverse operation.
Introduce two aliases like CGI.escape_uri_component(str)
Do not introduce CGI.encode_www_form_component (but improvement of the rdoc of CGI.escape is welcome)

(There was a very long discussion, but I didn't understand it due to my lack of knowledge. Please see the dev-meeting-log.)

Actions

Copy link

#5 [ruby-core:109413]

Updated by byroot (Jean Boussier) about 3 years ago

How about the following?

Sounds good to me at first sight. I can work on a patch for it soon, if I notice any issue when implementing it I'll report it back here.

And thank you for the meeting notes, it's incredibly useful or me.

Actions

Copy link

#6 [ruby-core:109422]

Updated by byroot (Jean Boussier) about 3 years ago

I forgot I already had a PR, I updated it to match the spec that was accepted during the meeting: https://github.com/ruby/cgi/pull/26

Actions

Copy link

#7 [ruby-core:109445]

Updated by sam.saffron (Sam Saffron) almost 3 years ago

+1, for context a similar issue we recently hit:

https://github.com/sporkmonger/addressable/issues/472

Turns out tons of projects these days rely on addressable for a more complete API, it would be nice only to need to lean on it in extreme outlier cases.

I wonder if there are other surfaces of addressable which should be pulled into core, at Discourse we lean on this:

https://github.com/discourse/discourse/blob/0df1c4eab2e1a15cd2414e88265fb9be329ac00b/lib/url_helper.rb#L21-L66

I think it is important to have "enough" tools in core MRI to deal with URLS such as https://ko.wikipedia.org/wiki/위%20/?abc%3A and somehow get a canonical URL out of it, just like web browsers are able to "figure out the right thing to do"

Actions

Copy link

#8 [ruby-core:109488]

Updated by sam.saffron (Sam Saffron) almost 3 years ago

Since we just finished working around a nightmare scenario here @byroot (Jean Boussier), I think it is rather instructive to see a real world problem

The problem:

You get something, that is probably a URL from somewhere and need to be able to make requests to it.

It can have a unicode domain that needs to run through an IDN converter
It can have unicode chars that need percent encoding
It can be unescaped, or it can be escaped (and in weird cases part escaped)

Ideally you want to normalize as well, so caching is "stronger" and does not break for identical URLs. (following the general guidelines in the RFC)

So we ended up with this monster and travesty, partly powered by URI in MRI, partly powered by addressable, 100% hack.

https://github.com/discourse/discourse/blob/main/lib/url_helper.rb#L72-L105

Actions

Copy link

#9 [ruby-core:109490]

Updated by byroot (Jean Boussier) almost 3 years ago

It's mostly just waiting for review. I'm not certain who's the maintainer though. @hsbt (Hiroshi SHIBATA) @nobu (Nobuyoshi Nakada) I added you both for review, but if someone is more suitable please let me know.

Actions

Copy link

#10 [ruby-core:109491]

Updated by ioquatix (Samuel Williams) almost 3 years ago

I'm positive on this feature, but I'm negative on the naming convention being introduced. Are we going to add aliases?

Actions

Copy link

#11 [ruby-core:109492]

Updated by byroot (Jean Boussier) almost 3 years ago

Ah right, I forgot to add the alias that were agreed upon. I'll open a followup: GCI.escape_uri_component and GCI.unescape_uri_component.

Actions

Copy link

#12 [ruby-core:109493]

Updated by byroot (Jean Boussier) almost 3 years ago

Here we go: https://github.com/ruby/cgi/pull/27

Actions

Copy link

#13 [ruby-core:109494]

Updated by ioquatix (Samuel Williams) almost 3 years ago

Should we also add aliases for escape_html and so on?

Actions

Copy link

#14 [ruby-core:109495]

Updated by byroot (Jean Boussier) almost 3 years ago

Maybe, but that would need approval. If you feel strongly about it please open a dedicated issue.

Actions

Copy link

#15

Updated by byroot (Jean Boussier) almost 3 years ago

Status changed from Open to Closed

Applied in changeset git|3850113e20b8c031529fc79de7202f61604425dd.

[ruby/cgi] Implement CGI.url_encode and CGI.url_decode

[Feature #18822]

Ruby is somewhat missing an RFC 3986 compliant escape method.

https://github.com/ruby/cgi/commit/c2729c7f33

Actions

Copy link

#16 [ruby-core:109511]

Updated by sam.saffron (Sam Saffron) almost 3 years ago

@byroot (Jean Boussier),

I am not sure the name is right here:

CGI.path_encode

with an alias of

CGI.params_encode

is far more correct.

Cause as it stands:

CGI.url_encode("https://i❤️.ws/❤️?test=❤️") will return an incorrect result.

CGI.url_encode("https://i❤️.ws") should return https://xn--i-7iq.ws/

Alternatively ... we "fix" url_encode?

Should I open a new ticket?

Actions

Copy link

#17 [ruby-core:109512]

Updated by byroot (Jean Boussier) almost 3 years ago

@sam.saffron it's my fault for forgetting to update the commit message. CGI.url_encode was never implemented, what was is CGI.escapeURIComponent.

Actions

Copy link

#18 [ruby-core:115054]

Updated by noraj (Alexandre ZANNI) almost 2 years ago

I just want to complete what was said before.

URI.escape and URI.unescape were deprecated but they were replaced by URI::Parser.new.escape and URI::Parser.new.unescape that implements RFC 2396. In fact this is calling URI::RFC2396_Parser.escape and URI::RFC2396_Parser.unescape.

But it's not useless since RFC 2396 was a Draft Standard and was obsoleted and updated by RFC 3986 which is an Internet Standard as CGI.escapeURIComponent and CGI.unescapeURIComponent implements RFC 3986.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Feature #18822

Ruby lack a proper method to percent-encode strings for URIs (RFC 3986)

Context¶

Ruby form and URL escape methods¶

Unescape methods¶

Other languages¶

Proposal¶

Updated by byroot (Jean Boussier) about 3 years ago

Updated by byroot (Jean Boussier) about 3 years ago

Updated by ioquatix (Samuel Williams) about 3 years ago

Updated by mame (Yusuke Endoh) about 3 years ago

Updated by byroot (Jean Boussier) about 3 years ago

Updated by byroot (Jean Boussier) about 3 years ago

Updated by sam.saffron (Sam Saffron) almost 3 years ago

Updated by sam.saffron (Sam Saffron) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by ioquatix (Samuel Williams) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by ioquatix (Samuel Williams) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by sam.saffron (Sam Saffron) almost 3 years ago

Updated by byroot (Jean Boussier) almost 3 years ago

Updated by noraj (Alexandre ZANNI) almost 2 years ago