Bug #680

csv.rb: CSV.parse is too late when encoding is mismatch

Added by Takeyuki FUJIOKA over 6 years ago. Updated almost 4 years ago.

[ruby-core:19465]
Status:Closed
Priority:Normal
Assignee:James Gray
ruby -v: Backport:

Description

=begin
I think this result is true, but encoding mismatch raise is too late.

see:
% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))'
ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total

% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"10000))'
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in =~': broken UTF-8 string (ArgumentError)
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in
init_separators'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in initialize'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in
new'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in parse'
from -e:1:in
'
ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"
10000))' 1.55s user 2.57s system 90% cpu 4.530 total
=end

sample.csv Magnifier (97.7 KB) Takeyuki FUJIOKA, 10/24/2008 07:25 PM

History

#1 Updated by Martin Dürst over 6 years ago

=begin
A default for the source encoding has been discussed quite a long
time ago (in some Japanese meetings or on ruby-dev, I don't remember),
and the conclusion was that the source encoding has to be given
(with a majic comment) in the file itself (unless the file is all ascii).

The reason for this is that the source encoding is a property of the
source, and nothing else. On very simple scripts, it might occasionally
be slightly easier if it were the same as default_external or
default_internal, but this is only the case as long as you stay
in exactly the same environment, and don't move the script.
But scripts grow and move, so it's better to get the settings
right at the start.

However, as far as I remember, the idea was that for -e,
default_external should be used, because that's what one
is using in a shell. I'm not sure why this doesn't work below.
(assuming Takeyuki is working in a Shift_JIS environment,
which isn't completely sure).

Regards, Martin.

At 12:12 08/10/24, Michael Selig wrote:

Hi,

This bug actually brings up an interesting issue - should the source

encoding default to something other than UTF-8 (ie: if it is not specified

in the "magic comment")?

Perhaps it should default to the encoding specified by the user's locale?
Or perhaps it should default to the value of "default_internal" if it is

set? Or even default_external?

I suggest that it should default to "default_internal" if that is set, and

then to the locale encoding if not.

What do others think?
Having it default to the locale in this case would probably avoid the

encoding mismatch entirely (and the resulting confusion).

Cheers
Mike

On Fri, 24 Oct 2008 11:58:33 +1100, Takeyuki Fujioka

redmine@ruby-lang.org wrote:

Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch
http://redmine.ruby-lang.org/issues/show/680

Author: Takeyuki Fujioka
Status: Open, Priority: Normal
Category: lib, Target version: 1.9.x

I think this result is true, but encoding mismatch raise is too late.

see:
% time ruby19 -rcsv -e

'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))'
ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total

% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"10000))'
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in =~': broken UTF-8
string (ArgumentError)
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in
init_separators'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in initialize'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in
new'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in parse'
from -e:1:in
'
ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"
10000))' 1.55s user

2.57s system 90% cpu 4.530 total


http://redmine.ruby-lang.org

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#2 Updated by Yukihiro Matsumoto over 6 years ago

=begin
Hi,

In message "Re: Re: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch)"
on Fri, 24 Oct 2008 16:48:04 +0900, "Michael Selig" michael.selig@fs.com.au writes:

|The problem I am trying to solve is the compatibility of string literals

|in your source and strings from other sources.
|
|"default_internal" was introduced to try to make all strings the same

|encoding to avoid incompatibilities. But at the moment string literals

|seem to default to the source encoding or to UTF-8 if oit is not set

|(please correct me if I am wrong). What I was suggesting was a way to make

|string literals be compatible.

You are correct here.

|This normally isn't a problem if:
|a) All string literals are 7 bit ASCII, or
|b) The source encoding matches "default_internal"
|
|If the source encoding of a program containing non-ascii string literals

|is set different from default_internal, you are asking for trouble, and

|would defeat the purpose of default_internal. Therefore to prevent the

|programmer from having to remember to specify both, it makes sense to me

|that the source encoding should default to default_internal. I think this

|is important.

The point is that when we have a source code written in source
encoding, the literals naturally encoded in that encoding. So do we
need to convert string literals in to default encoding? But
conversion can bring us more troubles, since they tend to change the
meaning, for example what /[-]/ mean, where and are
multi byte characters and their corresponding codepoints (and sorting
order) differ in converted encoding?

|(By the way, I am not talking about libraries here. As I have stressed

|previously, libraries should be carefully written to either use ASCII

|string literals only, or to make sure that it transcodes them properly.)

That makes me feel much better, so we can limit the issue about the
scripts only.

|Finally, are you suggesting that "-e" should perform differently to a

|single-line ruby script? That seems non-intuitive to me.

-e takes programs from command line shell, which probably yields
strings in locale encoding anyway. But we cannot assume that for
scripts contained in files.

                        matz.

=end

#3 Updated by Takeyuki FUJIOKA over 6 years ago

=begin
Please save as 'sample.csv' attached file.
This file include japanese UTF-8 string in first line.
Other line is us-ascii. Line number count is 5001.

% time ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)'
ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 0.23s user 0.01s system 96% cpu 0.254 total

this is OK very fast.
But:

% time ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)'
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in =~': broken EUC-JP string (ArgumentError)
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in
init_separators'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in initialize'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in
new'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in parse'
from -e:1:in
'
ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 3.93s user 6.38s system 98% cpu 10.457 total

this result is very slow.
I hope raise as soon as encoding mismatch found .

# Sorry, I don't understand M17N's default_external and default_internal behavior.
# I can't reply about M17N's problem.
=end

#4 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Fri, 24 Oct 2008 23:00:27 +0900,
James Gray wrote in :

I work on TextMate and we use Ruby all over the place inside that

application. I'm sure we have hundreds of scripts in there. We try

hard to make sure everything in TextMate is UTF-8, so now we get

errors out of Ruby 1.9. To fix, we need to add hundreds of magic

comments and worse, train our users who often write their own

automations in Ruby why they have to do the same to make their code

work.

The real issue here is that you can argue the user doesn't even know

the proper encoding these scripts should be using. Only TextMate

really knows the encoding it's going to hand-off the data in.

Though I don't know about TextMate at all, ruby-mode.el in 1.9
deals with magic comments automatically.

--
Nobu Nakada

=end

#5 Updated by James Gray over 6 years ago

  • % Done changed from 0 to 100
  • Status changed from Open to Closed

=begin
Applied in changeset r19931.
=end

#6 Updated by James Gray over 6 years ago

  • Assignee set to James Gray

=begin
Thanks for finding the bug in my logic. It should be much faster now:

$ time ruby_dev -Eeuc-jp -rlib/csv -e 'CSV.parse(open("/Users/james/Desktop/sample.csv","r").read)'
/Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in =~': broken EUC-JP string (ArgumentError)
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in
init_separators'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1563:in initialize'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in
new'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in parse'
from -e:1:in
'

real 0m0.053s
user 0m0.039s
sys 0m0.011s

=end

#7 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Sun, 26 Oct 2008 11:25:58 +0900,
Michael Selig wrote in :

1)
My preference would be to always encode string literals constructed with

"\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you

really want to use such a literal as an encoded string, you must use

"force_encoding". I think this would be much clearer and get rid of the

"ambiguity".

2)
My suggestion for "defaulting" the source encoding was an attempt to avoid

having to do this (but probably not a good way!). It isn't a big deal, and

I understand the argument that the source encoding is a property of the

script. My original suggestion (last month) of a special magic comment was

to have a way of specifying BOTH the default_internal and source encoding

once, but this idea was rejected.

I'd prefer to default the internal encoding to the source
encoding of the main script.

3)
Perhaps this check could be based on the library's source encoding? If

this were done, most libraries would have to use a source encoding of

US-ASCII (or just have no encoding magic comment) not UTF-8, so that

non-Unicode default_internal's will work. Perhaps Ruby could be smarter,

and only flag an error if there actually is an incomaptible string literal

in the library?

What about comments? I suspect it might not a good idea.

4)
Also it means that:
ruby test.rb
may perform differently than:
ruby -e "cat test.rb"

magic comments are effective with -e too.

$ ruby19 -e 'p ENCODING'
#Encoding:EUC-JP

$ ruby19 -e '#-- encoding:utf-8 --' -e 'p ENCODING'
#Encoding:UTF-8

Therefore no differences if the file has the magic comment.

--
Nobu Nakada

=end

#8 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Sun, 26 Oct 2008 17:20:17 +0900,
Michael Selig wrote in :

I'd prefer to default the internal encoding to the source
encoding of the main script.

But then how do you tell Ruby NOT to set "default_internal"?

I think defaulting the internal encoding to something other is
bad.

It also means that comments must be in the default_internal encoding (see

your comment below).

I don't follow you here, all comments should be written in the
source encoding. Why default_internal affects?

Therefore no differences if the file has the magic comment.

That's true, but my point was "why should a simple non-m17n non-ascii ruby

program have to contain the magic comment"?

Because, non-ascii. It's definitely enough reason.

--
Nobu Nakada

=end

#9 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 07:28:42 +0900,
Michael Selig wrote in :

Yes you are right, and I was not suggesting doing that.
But Matz wants to default default_internal to nil. With your proposal, how

do you do that and still set the source encoding?

I don't like the idea setting default_internal from source
encoding, but meant "it feels less worse" by "prefer".

My original suggestion was to use an extended "magic comment" to set both.

But it can't keep the source encoding unset, and
"internal_encoding" has no effect for Emacs.

Isn't backward compatibility with 1.8 scripts more important?
You are now forcing anyone with 1.8 scripts containing non-ascii string

literals to put in a magic comment, otherwise you get "inavlid multibyte

char (US-ASCII)" error in 1.9.

In other words, what you want is -K option?

--
Nobu Nakada

=end

#10 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 14:48:41 +0900,
Michael Selig wrote in :

OK, I don't use Emacs, and no one told me that before, thanks! I assumed

it would work, but I admit I didn't test it.
Then is there another form of magic comment that can be used - eg:

"internal encoding: XXXX" or "encoding: XXXX internal" that does work with

Emacs?

No. Magic comments without -*- markers are for VIM, like
# vim: set encoding=UTF-8
and, both of VIM and Emacs wouldn't work with your examples.

What I am saying is that we need to consider backward compatibility of

Ruby scripts. James Grey brought up an example with his "Textmate scripts"

which contain UTF-8 multibyte string literals, which used to work with

1.8, but do not in 1.9, because they need either a "magic comment" or, as

you say "-KU". Either way, 1.9 is not truly backward compatible when it

comes to simple, non-m17n, non-ascii scripts, because you have to either

modify the script or add a flag to the ruby options. There must be lots of

Japanese ruby scripts which will have a similar issue.

Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
sources, so they have had -K in the shebang lines already.

Defaulting source encoding to locale encoding (like -e does) should fix

this (as long as the end-user's locale is correct), right?

Yes if they match.

I guess if necessary James can put "-KU" in the RUBYOPT environment

variable to save having to add multiple magic comments, but I feel this

shouldn't be necessary.

-U option would be better.

--
Nobu Nakada

=end

#11 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 15:57:03 +0900,
Michael Selig wrote in :

Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
sources, so they have had -K in the shebang lines already.

Why then can I write a ruby 1.8 script which does a "puts" of a Shift_JIS

string (no shebang or magic comment), and have it run fine without -Ks?

Because you are avoiding troublesome chars. Without such
chars, we can't write the words "display", "table", "software"
and "ruby".

I guess if necessary James can put "-KU" in the RUBYOPT environment
variable to save having to add multiple magic comments, but I feel this
shouldn't be necessary.

-U option would be better.

I don't think that will work:

t2.rb is a single line script which does a puts of a short UTF-8 multibyte

string.

Indeed. -U sets only internal encoding, whereas -Ku sets also
external and source encodings. Therefore -U isn't direct
replacement for -Ku.

But it's very ambiguous and dangerous to imply encodings. We
can't trust locale for this purpose, at least.

You can use BOM to mean that the source is written in UTF-8.

--
Nobu Nakada

=end

#12 Updated by Martin Dürst over 6 years ago

=begin
At 07:28 08/10/27, Michael Selig wrote:

I thought one of your points was that you would like to be able to write

Japanese (or other non-ascii) comments which is otherwise only ascii

(which may use "\u" in literals, and want default_internal to be UTF-8).

This means that the source encoding should be Japanese. Your suggestion of

defaulting default_internal to the source encoding means that it will be

set to Japanese. I am not sure that this is always desirable. (This is

very minor - you can always override it)

I'm not sure what you mean by "Japanese". It's no problem at all
to use UTF-8 to write Japanese. And I guess if somebody uses
\u literals and wants default_internal to be UTF-8, they'll
in most cases use UTF-8 for the source encoding (comments or
whatever else).

If you mean Japanese legacy encodings (such as Shift_JIS and
EUC-JP), then your are correct, but it would be very rare
for somebody to use Shift_JIS or EUC-JP for comments when
the program is otherwise supposed to run all-UTF-8.

Isn't backward compatibility with 1.8 scripts more important?
You are now forcing anyone with 1.8 scripts containing non-ascii string

literals to put in a magic comment, otherwise you get "inavlid multibyte

char (US-ASCII)" error in 1.9.

Well, yes, that's actually the point of it. Wherever necessary,
get everybody to declare their encoding. It may be somewhat suboptimal
in the transition phase, but after that, we know what we're dealing
with.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#13 Updated by Martin Dürst over 6 years ago

=begin
At 14:48 08/10/27, Michael Selig wrote:

I am not sure why you would want to keep the source encoding unset when

setting default_internal at the top of a script. Perhaps you could explain.

The simplest case is a script in US-ASCII only, but where you want
the data to be handled e.g. in UTF-8.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#14 Updated by Martin Dürst over 6 years ago

=begin
At 12:24 08/10/27, James Gray wrote:

They sure could, yeah. Our policy for TextMate development has always

been that UTF-8 is king. We use it heavily and I'm sure some scripts

do contain multibyte characters in UTF-8.

Wouldn't it be only these scripts (including those that contain
\x escapes for UTF-8) that need the encoding indication at the top?
(please note that literals with \u escapes are automatically UTF-8).

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#15 Updated by Martin Dürst over 6 years ago

=begin
At 19:17 08/10/27, Michael Selig wrote:

On Mon, 27 Oct 2008 20:55:32 +1100, Nobuyoshi Nakada nobu@ruby-lang.org

wrote:

Hi,

At Mon, 27 Oct 2008 15:57:03 +0900,
Michael Selig wrote in :

Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
sources, so they have had -K in the shebang lines already.

Why then can I write a ruby 1.8 script which does a "puts" of a

Shift_JIS
string (no shebang or magic comment), and have it run fine without -Ks?

Because you are avoiding troublesome chars. Without such
chars, we can't write the words "display", "table", "software"
and "ruby".

OK, I'm sure you know more about Japanese encodings that I do.

To give you the details, these characters, in Shift_JIS, are
encoded with two bytes, the second of which is the same byte
as e.g. a backslash.

But my original point is that 1.8 scripts exist which contain multibyte

characters (eg UTF-8) which work fine under 1.8 without-K, but will fail

under 1.9 unless a magic comment or -K is provided.

Yes, that's because 1.8 is essentially garbage-in-garbage out.
If you are careful about certain bytes, you can essentially have
arbitrary byte sequences in your script, and Ruby 1.8 won't complain.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#16 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 19:17:45 +0900,
Michael Selig wrote in :

But my original point is that 1.8 scripts exist which contain multibyte

characters (eg UTF-8) which work fine under 1.8 without-K, but will fail

under 1.9 unless a magic comment or -K is provided.

It just seemed working by chance.

But it's very ambiguous and dangerous to imply encodings. We
can't trust locale for this purpose, at least.

It's a trade-off between that and backward compatibility. I think the

"danger" is not high and it gives backward compatibility, so my vote would

be to use it.

And it will suddenly crash or behave weirdly by moving other
locales.

Anyway, I think I understand the needs to specify source
encoding without magic comments. Is the option for that
purpose an acceptable solution?

--
Nobu Nakada

=end

#17 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 19:37:58 +0900,
Martin Duerst wrote in :

If you mean Japanese legacy encodings (such as Shift_JIS and
EUC-JP), then your are correct, but it would be very rare
for somebody to use Shift_JIS or EUC-JP for comments when
the program is otherwise supposed to run all-UTF-8.

I don't do it of course, but know that some people love to do
it.

--
Nobu Nakada

=end

#18 Updated by Yukihiro Matsumoto over 6 years ago

=begin
Hi,

In message "Re: Re: String literal encoding (Was: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch))"
on Tue, 28 Oct 2008 00:12:46 +0900, James Gray james@grayproductions.net writes:

|I wasn't aware -KU still worked though, as Michael pointed out. I

|thought for sure I had tried that and got a warning about it being

|ignored now.
|
|It may be that the TextMate team could use that. What all does it set

|in 1.9? Source encoding obviously. It seems to affect

|default_external as well, but not touch default_internal. Do I have

|that right? Does it have any other special effects?

-Ku (or -KU) specifies to

  • default script encoding to be UTF-8
  • default_external encoding to be UTF-8 unless it's specified previously by -E or -U
  • do not touch default_internal

|Will -KU stay supported for the foreseeable future?

Yes.

                        matz.

=end

#19 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Mon, 27 Oct 2008 21:07:16 +0900,
Nobuyoshi Nakada wrote in :

Anyway, I think I understand the needs to specify source
encoding without magic comments. Is the option for that
purpose an acceptable solution?

Here is the patch to add options:

--encoding=external:internal:source
--external-encoding=enc
--internal-encoding=enc
--source-encoding=enc

Index: ruby.c
===================================================================
--- ruby.c (revision 20075)
+++ ruby.c (working copy)
@@ -623,5 +623,5 @@ dump_option(const char *str, int len, vo

static void
-set_internal_encoding_once(struct cmdline_options *opt, const char *e, int elen)
+set_option_encoding_once(const char *type, VALUE *name, const char *e, int elen)
{
VALUE ename;
@@ -630,27 +630,16 @@ set_internal_encoding_once(struct cmdlin
ename = rb_str_new(e, elen);

  • if (opt->intern.enc.name &&
  • rb_funcall(ename, rb_intern("casecmp"), 1, opt->intern.enc.name) != INT2FIX(0)) {
  • if (*name &&
  • rb_funcall(ename, rb_intern("casecmp"), 1, *name) != INT2FIX(0)) { rb_raise(rb_eRuntimeError,
  • "default_intenal already set to %s", RSTRING_PTR(opt->intern.enc.name));
  • "%s already set to %s", type, RSTRING_PTR(*name)); }
  • opt->intern.enc.name = ename;
  • *name = ename; }

-static void
-set_external_encoding_once(struct cmdline_options *opt, const char *e, int elen)
-{
- VALUE ename;
-
- if (!elen) elen = strlen(e);
- ename = rb_str_new(e, elen);
-
- if (opt->ext.enc.name &&
- rb_funcall(ename, rb_intern("casecmp"), 1, opt->ext.enc.name) != INT2FIX(0)) {
- rb_raise(rb_eRuntimeError,
- "default_external already set to %s", RSTRING_PTR(opt->ext.enc.name));
- }
- opt->ext.enc.name = ename;
-}
+#define set_internal_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)
+#define set_external_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)

static int
@@ -956,13 +945,29 @@ proc_options(int argc, char *argv, stru
char *p;
encoding:
- p = strchr(s, ':');
- if (p) {
- if (p > s)
- set_external_encoding_once(opt, s, p-s);
- if (
++p)
- set_internal_encoding_once(opt, p, 0);
- }
- else

- set_external_encoding_once(opt, s, 0);
+ do {
+# define set_encoding_part(type) \
+ if (!(p = strchr(s, ':'))) { \
+ set_##type##encoding_once(opt, s, 0); \
+ break; \
+ } \
+ else if (p > s) { \
+ set
##type##encoding_once(opt, s, p-s); \
+ }
+ set_encoding_part(external);
+ if (!(s = ++p)) break;
+ set_encoding_part(internal);
+ if (!
(s = ++p)) break;
+ set_encoding_part(source);
+# undef set_encoding_part
+ } while (0);
+ }
+ else if (is_option_with_arg("internal-encoding", Qfalse, Qtrue)) {
+ set
internal_encoding_once(opt, s, 0);
+ }
+ else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) {
+ set_external_encoding_once(opt, s, 0);
+ }
+ else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) {
+ set_source_encoding_once(opt, s, 0);
}
else if (strcmp("version", s) == 0) {

--
Nobu Nakada

=end

#20 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Fri, 31 Oct 2008 18:38:24 +0900,
Nobuyoshi Nakada wrote in :

+#define set_internal_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)
+#define set_external_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)

Sorry, missed these 2 lines.

#define set_source_encoding_once(opt, e, elen) \
set_option_encoding_once("source", &opt->src.enc.name, e, elen)

--
Nobu Nakada

=end

#21 Updated by Martin Dürst over 6 years ago

=begin
At 18:38 08/10/31, Nobuyoshi Nakada wrote:

Hi,

At Mon, 27 Oct 2008 21:07:16 +0900,
Nobuyoshi Nakada wrote in :

Anyway, I think I understand the needs to specify source
encoding without magic comments. Is the option for that
purpose an acceptable solution?

Here is the patch to add options:

Great work!

--encoding=external:internal:source
--external-encoding=enc
--internal-encoding=enc
--source-encoding=enc

I personally don't like the last one, and the :source in the first
one, but I guess there are situations where they can be very helpful
(e.g. testing with different encodings).

I also think that it would be good to have the values of --encoding
and -E look/work the same, so unless :source already works on -E,
I think having just --source-encoding for the case that the
source encoding must be set by an option should be okay.
This will also make it easier to distinguish in documentation
that --source-encoding is really only for very special occasions,
and declaring the source encoding in the script itself is strongly
preferred.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

=end

#22 Updated by Nobuyoshi Nakada over 6 years ago

=begin
Hi,

At Fri, 31 Oct 2008 19:05:25 +0900,
Martin Duerst wrote in :

--encoding=external:internal:source
--external-encoding=enc
--internal-encoding=enc
--source-encoding=enc

I personally don't like the last one, and the :source in the first
one, but I guess there are situations where they can be very helpful
(e.g. testing with different encodings).

I also think that it would be good to have the values of --encoding
and -E look/work the same, so unless :source already works on -E,
I think having just --source-encoding for the case that the
source encoding must be set by an option should be okay.

-E equals to --encoding.

This will also make it easier to distinguish in documentation
that --source-encoding is really only for very special occasions,
and declaring the source encoding in the script itself is strongly
preferred.

Since these four options are separated, so it's easy to remove
some of them.

--
Nobu Nakada

=end

Also available in: Atom PDF