Project

General

Profile

Feature #10391

Provide %eISO-8859-1'string \xAA literal' string literals with explicit encoding

Added by duerst (Martin Dürst) almost 5 years ago. Updated almost 5 years ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:65743]

Description

There is occasionally a need to use a string literal with an Encoding different from the source encoding.
This proposes to use %e (e for encoding) to introduce such string literals.

The syntax used in the subject relies on the fact that the set of characters used in Encoding names and the set of characters used to surround the actual string in a %-literal are completely disjoint (or if they currently aren't, can be made completely disjoint). Alternatives would be to use % as a separator before and/or after the encoding, e.g. like this:

  • %eISO-8859-1'string \xAA literal' # original proposal
  • %e%ISO-8859-1%'string \xAA literal' # before and after
  • %e%ISO-8859-1'string \xAA literal' # before only
  • %eISO-8859-1%'string \xAA literal' # after only
  • %e(ISO-8859-1)(string \xAA literal) # surrounding the encoding name

The most frequent use of this would be with binary, so we probably want to allow a shortcut for binary, e.g.

  • %eB'binary \x80 string' or even just
  • %b'binary \x08 string' We could then in the long term deprecate String#b, and go back to check string validity at creation.

The upper/lowercase distinction can be used to distinguish single-quoted strings (%e) and double-quoted strings (%E). We probably also want something for regular expressions, but I'm not sure which letter is best.

There is one question about semantics: What's the meaning of e.g. %eGB2312'松本' in a program with a source encoding of UTF-8 or Shift_JIS? In some cases, it might be convenient to have the result contain the same characters. But that would mean that the data needs to be transcoded, and that could fail. The easier way to define this is that the result is the same as '松本'.force_encoding('GB2312'), i.e. just using the byte values.


Related issues

Related to CommonRuby - Feature #8848: Syntax for binary stringsOpen08/31/2013Actions

History

Updated by duerst (Martin Dürst) almost 5 years ago

Updated by akr (Akira Tanaka) almost 5 years ago

It is useful when string literals are frozen.
So I think this feature is good to have.
The syntax is a problem, though.

However I feel %eGB2312'文字列' preserves characters, not bytes.
I.e. I think %eGB2312'文字列' should be interpreted as '文字列'.encode('GB2312').
I expect SyntaxError when encode() fails.
(Similar situation: /*/ is an invalid regexp which is also SyntaxError.)

Also available in: Atom PDF