Feature #18822
closedRuby lack a proper method to percent-encode strings for URIs (RFC 3986)
Description
Context¶
There are two fairly similar encoding methods that are often confused.
application/x-www-form-urlencoded
which is how form data is encoded, and "percent-encoding" as defined by RFC 3986.
AFAIK, the only way they differ is that "form encoding" escape space characters as +
, and RFC 3986 escape them as %20
. Most of the time it doesn't matter, but sometimes it does.
Ruby form and URL escape methods¶
-
URI.escape(" ") # => "%20"
but it was deprecated and removed (in 3.0 ?). -
ERB::Util.url_encode(" ") # => "%20"
but it's implemented with agsub
and isn't very performant. It's also awkward to have to reach forERB
CGI.escape(" ") # => "+"
URI.encode_www_form_component(" ") # => "+"
Unescape methods¶
For unescaping, it's even more of a clear cut since URI.unescape
was removed. So there's no available method that won't treat an unescaped +
as simply +
.
e.g. in Javascript: decodeURIComponent("foo+bar") #=> "foo+bar"
.
If one were to use CGI.unescape
, the string might be improperly decoded: GI.unescape("foo+bar") #=> "foo bar"
.
Other languages¶
- Javascript
encodeURI
andencodeURIComponent
use%20
. - PHP has
urlencode
using+
andrawurlencode
using%20
. - Python has
urllib.parse.quote
using%20
andurllib.parse.quote_plus
using+
.
Proposal¶
Since CGI
already have a very performant encoder for application/x-www-form-urlencoded
, I think it would make sense that it would provide another method for RFC3986.
I propose:
CGI.url_encode(" ") # => "%20"
- Or
CGI.encode_url
. - Alias
CGI.escape
asGCI.encode_www_form_component
- Clarify the documentation of
CGI.escape
.
Updated by byroot (Jean Boussier) over 2 years ago
- Description updated (diff)
I forgot to mention that we'd need the decode method as well.
Updated by byroot (Jean Boussier) over 2 years ago
Proposed implementation: https://github.com/ruby/cgi/pull/26
Updated by ioquatix (Samuel Williams) over 2 years ago
This looks good to me and I think it's a good addition.
In the past, I've referred to this operation as "URI.encode_path" as it seems specifically related to how paths are created and escaped. However, I'd be interested to know if there is a better interpretation of that RFC w.r.t. the naming of the operation.
The counter point is, "url_encode" seems a little confusing, because it sounds like it's encoding a "URL". One could imagine an interface like url_encode(scheme, user, password, host, port, path, query, fragment)
.
Updated by mame (Yusuke Endoh) over 2 years ago
We discussed this issue at the dev meeting. How about the following?
- Introduce
CGI.escapeURIComponent(str)
that behaves likeCGI.escape
, except that a space is encoded as%20
instead of+
(as @byroot (Jean Boussier) proposed) - Introduce
CGI.unescapeURIComponent(str)
that is a reverse operation. - Introduce two aliases like
CGI.escape_uri_component(str)
- Do not introduce
CGI.encode_www_form_component
(but improvement of the rdoc ofCGI.escape
is welcome)
(There was a very long discussion, but I didn't understand it due to my lack of knowledge. Please see the dev-meeting-log.)
Updated by byroot (Jean Boussier) over 2 years ago
How about the following?
Sounds good to me at first sight. I can work on a patch for it soon, if I notice any issue when implementing it I'll report it back here.
And thank you for the meeting notes, it's incredibly useful or me.
Updated by byroot (Jean Boussier) over 2 years ago
I forgot I already had a PR, I updated it to match the spec that was accepted during the meeting: https://github.com/ruby/cgi/pull/26
Updated by sam.saffron (Sam Saffron) over 2 years ago
+1, for context a similar issue we recently hit:
https://github.com/sporkmonger/addressable/issues/472
Turns out tons of projects these days rely on addressable for a more complete API, it would be nice only to need to lean on it in extreme outlier cases.
I wonder if there are other surfaces of addressable which should be pulled into core, at Discourse we lean on this:
I think it is important to have "enough" tools in core MRI to deal with URLS such as https://ko.wikipedia.org/wiki/위%20/?abc%3A and somehow get a canonical URL out of it, just like web browsers are able to "figure out the right thing to do"
Updated by sam.saffron (Sam Saffron) over 2 years ago
Since we just finished working around a nightmare scenario here @byroot (Jean Boussier), I think it is rather instructive to see a real world problem
The problem:
You get something, that is probably a URL from somewhere and need to be able to make requests to it.
- It can have a unicode domain that needs to run through an IDN converter
- It can have unicode chars that need percent encoding
- It can be unescaped, or it can be escaped (and in weird cases part escaped)
Ideally you want to normalize as well, so caching is "stronger" and does not break for identical URLs. (following the general guidelines in the RFC)
So we ended up with this monster and travesty, partly powered by URI in MRI, partly powered by addressable, 100% hack.
https://github.com/discourse/discourse/blob/main/lib/url_helper.rb#L72-L105
Updated by byroot (Jean Boussier) over 2 years ago
It's mostly just waiting for review. I'm not certain who's the maintainer though. @hsbt (Hiroshi SHIBATA) @nobu (Nobuyoshi Nakada) I added you both for review, but if someone is more suitable please let me know.
Updated by ioquatix (Samuel Williams) over 2 years ago
I'm positive on this feature, but I'm negative on the naming convention being introduced. Are we going to add aliases?
Updated by byroot (Jean Boussier) over 2 years ago
Ah right, I forgot to add the alias that were agreed upon. I'll open a followup: GCI.escape_uri_component
and GCI.unescape_uri_component
.
Updated by byroot (Jean Boussier) over 2 years ago
Here we go: https://github.com/ruby/cgi/pull/27
Updated by ioquatix (Samuel Williams) over 2 years ago
Should we also add aliases for escape_html
and so on?
Updated by byroot (Jean Boussier) over 2 years ago
Maybe, but that would need approval. If you feel strongly about it please open a dedicated issue.
Updated by byroot (Jean Boussier) over 2 years ago
- Status changed from Open to Closed
Applied in changeset git|3850113e20b8c031529fc79de7202f61604425dd.
[ruby/cgi] Implement CGI.url_encode
and CGI.url_decode
[Feature #18822]
Ruby is somewhat missing an RFC 3986 compliant escape method.
Updated by sam.saffron (Sam Saffron) over 2 years ago
I am not sure the name is right here:
CGI.path_encode
with an alias of
CGI.params_encode
is far more correct.
Cause as it stands:
CGI.url_encode("https://i❤️.ws/❤️?test=❤️")
will return an incorrect result.
CGI.url_encode("https://i❤️.ws")
should return https://xn--i-7iq.ws/
Alternatively ... we "fix" url_encode?
Should I open a new ticket?
Updated by byroot (Jean Boussier) over 2 years ago
@sam.saffron it's my fault for forgetting to update the commit message. CGI.url_encode
was never implemented, what was is CGI.escapeURIComponent
.
Updated by noraj (Alexandre ZANNI) over 1 year ago
I just want to complete what was said before.
URI.escape and URI.unescape were deprecated but they were replaced by URI::Parser.new.escape and URI::Parser.new.unescape that implements RFC 2396. In fact this is calling URI::RFC2396_Parser.escape and URI::RFC2396_Parser.unescape.
But it's not useless since RFC 2396 was a Draft Standard and was obsoleted and updated by RFC 3986 which is an Internet Standard as CGI.escapeURIComponent and CGI.unescapeURIComponent implements RFC 3986.