https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112009-02-03T10:42:49ZRuby Issue Tracking SystemRuby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=29502009-02-03T10:42:49Zshyouhei (Shyouhei Urabe)shyouhei@ruby-lang.org
<ul><li><strong>Assignee</strong> set to <i>akr (Akira Tanaka)</i></li></ul><p>=begin</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=29522009-02-03T10:44:15Zakr (Akira Tanaka)akr@fsij.org
<ul></ul><p>=begin<br>
What's the usecases?<br>
=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30452009-02-03T22:21:16Zradarek (Radosław Bułat)
<ul></ul><p>=begin<br>
Reading N characters from stream is (at least for me) as natural as reading N bytes. Usecases are almost the same as for bytes but when you want operate on characters. For example you have file encoded in utf-8 and want to read first 10 characters.<br>
=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30492009-02-03T23:06:41ZJEG2 (James Gray)jeg2@ruby-lang.org
<ul></ul><p>=begin<br>
I needed this in the standard CSV library.</p>
<p>My use case was that I peek ahead in the stream to determine what kind of line endings it has. I just grab a block of characters and see if I find any standard line endings in there. This was pretty challenging in Ruby 1.9, because just reading some bytes meant I had great chances to have picked up invalid data. Then, when I hit it with a Regexp to find the line endings, an Exception is raised.</p>
<p>I'm using this code to get around that problem in CSV:</p>
<h1></h1>
<a name="Reads-at-least-bytes-from-io-but-will-read-up-10-bytes-ahead-if"></a>
<h1 >Reads at least +bytes+ from <tt>@io</tt>, but will read up 10 bytes ahead if<a href="#Reads-at-least-bytes-from-io-but-will-read-up-10-bytes-ahead-if" class="wiki-anchor">¶</a></h1>
<a name="needed-to-ensure-the-data-read-is-valid-in-the-ecoding-of-that-data-This"></a>
<h1 >needed to ensure the data read is valid in the ecoding of that data. This<a href="#needed-to-ensure-the-data-read-is-valid-in-the-ecoding-of-that-data-This" class="wiki-anchor">¶</a></h1>
<a name="should-ensure-that-it-is-safe-to-use-regular-expressions-on-the-read-data"></a>
<h1 >should ensure that it is safe to use regular expressions on the read data,<a href="#should-ensure-that-it-is-safe-to-use-regular-expressions-on-the-read-data" class="wiki-anchor">¶</a></h1>
<a name="unless-it-is-actually-a-broken-encoding-The-read-data-will-be-returned-in"></a>
<h1 >unless it is actually a broken encoding. The read data will be returned in<a href="#unless-it-is-actually-a-broken-encoding-The-read-data-will-be-returned-in" class="wiki-anchor">¶</a></h1>
<h1>
<tt>@encoding</tt>.</h1>
<h1></h1>
<p>def read_to_char(bytes)<br>
return "" if @io.eof?<br>
data = @io.read(bytes)<br>
begin<br>
encoded = encode_str(data)<br>
raise unless encoded.valid_encoding?<br>
return encoded<br>
rescue # encoding error or my invalid data raise<br>
if @io.eof? or data.size >= bytes + 10<br>
return data<br>
else<br>
data += @io.read(1) until data.valid_encoding? or<br>
@io.eof? or<br>
data.size >= bytes + 10<br>
retry<br>
end<br>
end<br>
end</p>
<p>That worked for CSV, where I just need some characters and don't have to have an exact count. If you do need an exact count though, the code gets more complicated.</p>
<p>I agree that this is something Ruby should do for us.</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30522009-02-04T00:47:15Zradarek (Radosław Bułat)
<ul></ul><p>=begin<br>
I wonder also about byte-oriented IO#seek if someone want to have<br>
character-oriented seek. It looks like byte-oriented seekd is useless<br>
in multibyte character-oriented stream because it could jump to bad<br>
position (in the middle of character bytes).</p>
<p>--<br>
Pozdrawiam</p>
<p>Radosław Bułat<br>
<a href="http://radarek.jogger.pl" class="external">http://radarek.jogger.pl</a> - mój blog</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30592009-02-04T08:27:54Zmike (Michael Selig)michael_selig@fs.com.au
<ul></ul><p>=begin<br>
I have a simple use-case:</p>
<p>Existing datafile has fixed length records, currently single-byte chars. I want to convert the application (which is quite old) to support multi-byte characters, but I don't want to have to go to the trouble of changing to variable-length or delimited records/fields. I would like to be able to read each record (whose length in chars I know) with one operation, instead of looping through each character.</p>
<p>Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!<br>
=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30602009-02-04T09:34:20Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
On 04/02/2009, Michael Selig <a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a> wrote:</p>
<blockquote>
<p>Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!</p>
</blockquote>
<p>For UTF-8/UTF-16/SJIS/EUC-JP/BIG5 .. no. UTF-32 is dword aligned but<br>
you cannot tell what byte ordering it uses reliably. Bad thing. Seeks<br>
are probably not for text files or only for text files you have parsed<br>
already so you know where you are seeking.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=30612009-02-04T09:38:54Zmatz (Yukihiro Matsumoto)matz@ruby.or.jp
<ul></ul><p>=begin<br>
Hi,</p>
<p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/21817">[ruby-core:21817]</a> Re: [Feature <a class="issue tracker-2 status-6 priority-4 priority-default closed" title="Feature: Should be an easy way of reading N characters from am I/O stream (Rejected)" href="https://bugs.ruby-lang.org/issues/908">#908</a>] Should be an easy way of reading N characters from am I/O stream"<br>
on Wed, 4 Feb 2009 09:33:44 +0900, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p>
<p>|> Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!<br>
|<br>
|For UTF-8/UTF-16/SJIS/EUC-JP/BIG5 .. no. UTF-32 is dword aligned but<br>
|you cannot tell what byte ordering it uses reliably. Bad thing. Seeks<br>
|are probably not for text files or only for text files you have parsed<br>
|already so you know where you are seeking.</p>
<p>Right. Hence I reject the character based seek. Thank you.</p>
<p>Regarding the original N character read, I am positive, but still<br>
haven't decided yet for API.</p>
<pre><code> matz.
</code></pre>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32242009-02-15T10:11:28Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/14 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p>
<blockquote>
<p>In article <a href="mailto:op.uotab6oa9245dp@kool" class="email">op.uotab6oa9245dp@kool</a>,<br>
"Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p>
<blockquote>
<p>That's right - These files are quite small, and I only need to do<br>
sequential I/O. I want to keep the format backward compatible when using a<br>
single-byte encoding.</p>
</blockquote>
<p>Whould you show an example of such format?</p>
<p>I couldn't imagine a fixed length field which single byte<br>
encoding (US-ASCII) is usable and multibyte encoding is<br>
useful.</p>
<p>For example, zip code or some fixed numbering system is<br>
fixed length but multibyte encoding is not useful.</p>
</blockquote>
<p>Let's make it more general - what about the first N characters or first N lines?</p>
<p>I'm sure you can understand this is useful.</p>
<p>How does the lines() Enumerator interact with the IO?</p>
<p>If a method like head(N) was implemented on it would it leave the IO<br>
pointing to the text after the first N records, be it chars, lines, or<br>
anything else?</p>
<p>Can that Enumerator be created so that it starts enumerating at the<br>
current file position?</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32372009-02-16T19:53:03Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/15 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p>
<blockquote>
<p>In article <a href="mailto:a5d587fb0902141711q780f0d24jef9be9b8bbe69b2a@mail.gmail.com" class="email">a5d587fb0902141711q780f0d24jef9be9b8bbe69b2a@mail.gmail.com</a>,<br>
Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p>
<blockquote>
<blockquote>
<p>For example, zip code or some fixed numbering system is<br>
fixed length but multibyte encoding is not useful.</p>
</blockquote>
<p>Let's make it more general - what about the first N characters or first N lines?</p>
<p>I'm sure you can understand this is useful.</p>
</blockquote>
<p>I think I don't understand the usefulness until an actual<br>
example is shown.</p>
<blockquote>
<p>If a method like head(N) was implemented on it would it leave the IO<br>
pointing to the text after the first N records, be it chars, lines, or<br>
anything else?</p>
</blockquote>
<p>What is represented by the N chars?</p>
</blockquote>
<p>I don't understand the question. N chars are N chars, they do not<br>
represent anything else.</p>
<p>It's actually not that hard except the synchronization is not perfect.<br>
By using chars and then lines I lost "F"</p>
<p>irb(main):001:0> f=File.open "rom.asm"<br>
=> #<a href="File:rom.asm" class="external">File:rom.asm</a><br>
irb(main):002:0> f.chars.take(10)<br>
=> ["0", "0", "0", "0", "0", "0", "0", "0", " ", " "]<br>
irb(main):003:0> f.lines.take(3)<br>
=> ["A cli\n", "00000001 FC cld\n",<br>
"00000002 66670F0115000000 o32 lgdt [dword 0x0]\n"]<br>
irb(main):004:0> f.seek(0)<br>
=> 0<br>
irb(main):005:0> f.lines.take(1)<br>
=> ["00000000 FA cli\n"]</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32562009-02-19T00:21:43Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/18 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p>
<blockquote>
<p>In article <a href="mailto:a5d587fb0902160252u56b50cfdv8e0fd36bb4f0b1b3@mail.gmail.com" class="email">a5d587fb0902160252u56b50cfdv8e0fd36bb4f0b1b3@mail.gmail.com</a>,<br>
Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p>
<blockquote>
<blockquote>
<p>What is represented by the N chars?</p>
</blockquote>
<p>I don't understand the question. N chars are N chars, they do not<br>
represent anything else.</p>
</blockquote>
<p>I expect something like person's name, zip code, etc.</p>
<p>However, person's name is variable length.</p>
<p>The zip code (in Japan) is fixed length but multibyte<br>
encoding is not useful because it uses only digits.</p>
</blockquote>
<p>As was explained by the original poster there are file formats similar<br>
to CSV that use fixed field length instead of separators. I have<br>
myself used such files, and they were in 8-bit fixed width encoding.</p>
<p>However, if you want to "upgrade" your code that uses such files to<br>
multibyte for international support you need reading N characters.</p>
<p>Of course, the alternative is to change your code to use a different<br>
format.This might make exports to and imports from legacy applications<br>
hard, however.</p>
<p>Sure, the export can never be perfect if the files really contain<br>
internationalized data because recoding to the legacy format and<br>
encoding loses some information then.</p>
<blockquote>
<p>I'm not sure the usage of the method for "reading N<br>
characters".</p>
</blockquote>
<p>Yes, reading N characters does not seem very useful outside of very<br>
specialized scenarios. Most sane file formats use string length in<br>
bytes or separators.</p>
<p>However, reading N characters, lines, or any other units for which you<br>
have an IO enumerator seems useful to me.</p>
<p>Actually reading N lines using the correct line separator would fetch<br>
N records from the file without the need to construct a loop for that<br>
(or repeat the method for reading a line N times).</p>
<blockquote>
<blockquote>
<p>It's actually not that hard except the synchronization is not perfect.<br>
By using chars and then lines I lost "F"</p>
</blockquote>
<p>I guess enumerator uses lookahead.</p>
</blockquote>
<p>That's unfortunate for using the Enumerator with other methods for<br>
reading files.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32712009-02-19T19:55:38Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/19 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p>
<blockquote>
<p>In article <a href="mailto:op.upklh9q19245dp@kool" class="email">op.upklh9q19245dp@kool</a>,<br>
"Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p>
</blockquote>
<blockquote>
<blockquote>
<p>Also it seems to me that the current usage of the "limit" parameter of<br>
IO#gets is not intuitive in 1.9. It is "maximum number of bytes, but don't<br>
split a character", and I think it should be changed to mean "maximum<br>
number of chars". That would be much more obvious, more useful (IMHO), and<br>
still be backward compatible with 1.8.</p>
</blockquote>
<p>It is introduced for security reason. bytes are more stable<br>
than characters.</p>
</blockquote>
<p>However, the security would be served as well by a character limit.</p>
<p>As I understand it this limit is introduced so that a gets does not<br>
read several gigabytes of data at once in case there is no line<br>
separator.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32722009-02-19T20:01:49Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/19 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p>
<blockquote>
<p>In article <a href="mailto:op.upklh9q19245dp@kool" class="email">op.upklh9q19245dp@kool</a>,<br>
"Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p>
<blockquote>
<p>In more detail: I have a legacy system that uses fixed length fields. Yes,<br>
a name is variable length, but some old systems use a fixed length field,<br>
say 40 chars, which is space filled on the right (or truncated). In my<br>
case, the data input is by a form, and each field is fixed width. I am<br>
changing the system so that the SAME forms can be used, but extended to<br>
use UTF-8 not just ASCII. So this means that the number of characters is<br>
still fixed, but the number of bytes is no longer fixed. I do <em>not</em> want<br>
to change the format of the file (though it probably should be, but that<br>
would be a lot more work), because I want the application to be backward<br>
compatible (when using ASCII data).</p>
</blockquote>
<p>This is what I'd like to hear. Thank you for explanation.</p>
<p>It seems the number, 40, is a number for "big enough for<br>
names".</p>
<p>Why don't you use 40 bytes data format, both with Ruby 1.8<br>
and 1.9?</p>
<p>Do you think that 40 bytes is not big enough for names in<br>
some country?</p>
<p>If the data format uses 40 bytes, instead of 40 chars,<br>
it is easy to read it in Ruby 1.8, even if it contains UTF-8<br>
chars.</p>
</blockquote>
<p>While this might ease working with the file data it might make<br>
designing the form more challenging.</p>
<p>The things to consider:</p>
<ul>
<li>checking for byte length rather than character length in something<br>
like JavaScript (probably possible)</li>
<li>explaining the length limit to the user of the application (I would<br>
not want to do that)</li>
<li>making sure that 40 bytes is long enough for names in languages<br>
that use exotic characters</li>
</ul>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32822009-02-20T13:51:25Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 19:59 09/02/19, Michael Selig wrote:</p>
<blockquote>
<p>Also there are reports reading the data which expect the data to be 40<br>
characters wide. If it wasn't 40 chars, the formatting of the report may<br>
screw up.</p>
</blockquote>
<p>Hello Michael,</p>
<p>In general, I agree that being able to work with character numbers<br>
is desirable. The implementation isn't exactly easy, but I hope<br>
eventually we will get there. My current guess is that this might<br>
mean that we have to move IO and related stuff a bit more towards<br>
a model with classes stacked on top of each other. But that's just<br>
a guess.</p>
<p>But regarding your point of format screwup, measuring things in<br>
characters won't help. Assuming that each character has the same<br>
width just doesn't carry very far if you look at all the scripts<br>
around the world.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=32832009-02-20T13:51:37Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 19:00 09/02/19, Tanaka Akira wrote:</p>
<blockquote>
<p>It seems the number, 40, is a number for "big enough for<br>
names".</p>
<p>Why don't you use 40 bytes data format, both with Ruby 1.8<br>
and 1.9?</p>
<p>Do you think that 40 bytes is not big enough for names in<br>
some country?</p>
</blockquote>
<p>Very much so. A typical example would be Georgia, where<br>
many names are as long as some of the longer ones in<br>
Europe, but they require 3 bytes per character.</p>
<blockquote>
<blockquote>
<p>Also it seems to me that the current usage of the "limit" parameter of<br>
IO#gets is not intuitive in 1.9. It is "maximum number of bytes, but don't<br>
split a character", and I think it should be changed to mean "maximum<br>
number of chars". That would be much more obvious, more useful (IMHO), and<br>
still be backward compatible with 1.8.</p>
</blockquote>
<p>It is introduced for security reason. bytes are more stable<br>
than characters.</p>
</blockquote>
<p>Can you give more specific explanations of why reading a number<br>
of characters might not be secure?</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33152009-02-23T17:34:39Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 01:00 09/02/23, Tanaka Akira wrote:</p>
<blockquote>
<p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br>
Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p>
</blockquote>
<blockquote>
<blockquote>
<p>Can you give more specific explanations of why reading a number<br>
of characters might not be secure?</p>
</blockquote>
<p>I considered ISO-2022-JP, Unicode combining characters and<br>
Punycode.</p>
<p>In these encodings, fixed number of characters doesn't limit<br>
the number of bytes.</p>
</blockquote>
<p>Why do you think there is a need to limit the number of bytes?<br>
In general, that's not how Ruby works, at least not as far as<br>
I understand.</p>
<p>Regards, Martin.</p>
<blockquote>
<h2>However they may not cause problem now because Ruby doesn't<br>
support combining characters, etc. But Ruby's encoding<br>
system is extensible. It is possible to define an encoding<br>
which makes the character-wise limit insecure.</h2>
<p>Tanaka Akira</p>
</blockquote>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33162009-02-23T19:02:59Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/22 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p>
<blockquote>
<p>On Mon, 23 Feb 2009 03:00:41 +1100, Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a> wrote:</p>
<blockquote>
<p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br>
Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p>
<blockquote>
<p>Can you give more specific explanations of why reading a number<br>
of characters might not be secure?</p>
</blockquote>
<p>I considered ISO-2022-JP, Unicode combining characters and<br>
Punycode.</p>
<p>In these encodings, fixed number of characters doesn't limit<br>
the number of bytes.</p>
</blockquote>
<p>Sure, but how does that make it "insecure"?</p>
<blockquote>
<p>However they may not cause problem now because Ruby doesn't<br>
support combining characters, etc. But Ruby's encoding<br>
system is extensible. It is possible to define an encoding<br>
which makes the character-wise limit insecure.</p>
</blockquote>
<p>Sorry, I do not really understand what you mean by the word "insecure".<br>
Perhaps you could explain what you mean in more detail.<br>
Also I still do not understand why you say the character limit might be<br>
"insecure". Can you give an example, please?</p>
</blockquote>
<p>Theoretically if separate (combining) character accents are considered<br>
part of the character then a character might be quite long - I guess<br>
about ten codepoints which can be themselves up to six bytes. However,<br>
the number of accents one can put together should be limited to<br>
meaningful combinations so this should still be secure - as long as<br>
the code which determines what is a valid character does not have<br>
bugs. This might be tricky in some cases, though.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33172009-02-23T19:36:17Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/23 Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a>:</p>
<blockquote>
<p>2009/2/22 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p>
<blockquote>
<p>On Mon, 23 Feb 2009 03:00:41 +1100, Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a> wrote:</p>
<blockquote>
<p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br>
Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p>
<blockquote>
<p>Can you give more specific explanations of why reading a number<br>
of characters might not be secure?</p>
</blockquote>
<p>I considered ISO-2022-JP, Unicode combining characters and<br>
Punycode.</p>
<p>In these encodings, fixed number of characters doesn't limit<br>
the number of bytes.</p>
</blockquote>
<p>Sure, but how does that make it "insecure"?</p>
<blockquote>
<p>However they may not cause problem now because Ruby doesn't<br>
support combining characters, etc. But Ruby's encoding<br>
system is extensible. It is possible to define an encoding<br>
which makes the character-wise limit insecure.</p>
</blockquote>
<p>Sorry, I do not really understand what you mean by the word "insecure".<br>
Perhaps you could explain what you mean in more detail.<br>
Also I still do not understand why you say the character limit might be<br>
"insecure". Can you give an example, please?</p>
</blockquote>
<p>Theoretically if separate (combining) character accents are considered<br>
part of the character then a character might be quite long - I guess<br>
about ten codepoints which can be themselves up to six bytes. However,<br>
the number of accents one can put together should be limited to<br>
meaningful combinations so this should still be secure - as long as<br>
the code which determines what is a valid character does not have<br>
bugs. This might be tricky in some cases, though.</p>
</blockquote>
<p>BTW the same goes for "reading N bytes up to a character boundary"<br>
unless you are willing to accept that you might read nothing even if<br>
data is available because the character did not fit into N bytes, even<br>
for a large N.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33232009-02-24T10:07:56Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 19:02 09/02/23, Michal Suchanek wrote:</p>
<blockquote>
<p>Theoretically if separate (combining) character accents are considered<br>
part of the character then a character might be quite long - I guess<br>
about ten codepoints</p>
</blockquote>
<p>For actual real-life examples, much less than that.</p>
<p>For Indic grapheme clusters (which use mostly base characters,<br>
not combining characters), the number can indeed get to around 10.</p>
<p>In theory, there is no limitation as to how many combining characters<br>
can follow a base character.</p>
<blockquote>
<p>which can be themselves up to six bytes.</p>
</blockquote>
<p>No, just up to four.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33352009-02-24T22:21:58Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/24 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p>
<blockquote>
<p>On Mon, 23 Feb 2009 21:35:30 +1100, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a><br>
wrote:</p>
<blockquote>
<blockquote>
<p>Theoretically if separate (combining) character accents are considered<br>
part of the character then a character might be quite long - I guess<br>
about ten codepoints which can be themselves up to six bytes. However,<br>
the number of accents one can put together should be limited to<br>
meaningful combinations so this should still be secure - as long as<br>
the code which determines what is a valid character does not have<br>
bugs. This might be tricky in some cases, though.</p>
</blockquote>
<p>BTW the same goes for "reading N bytes up to a character boundary"<br>
unless you are willing to accept that you might read nothing even if<br>
data is available because the character did not fit into N bytes, even<br>
for a large N.</p>
</blockquote>
<p>The current behaviour of IO#gets "limit" parameter is "read N bytes but<br>
round <em>UP</em> to the next character boundary". Therefore you may get more bytes<br>
returned than requested. As long as "limit" is 1 or more, you should always<br>
read something unless the file is at EOF, right?</p>
<p>However I still do not understand why reading N characters (instead of the<br>
current "limit" implementation) might be described as being "insecure". Can<br>
someone please explain it? It is clear that the string returned may be<br>
larger than N bytes long, but why is that "insecure"?</p>
</blockquote>
<p>The current gets is not any more secure since it rounds up. If you<br>
found a bug in an encoding that could give you an infinite character<br>
either version would break.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=33482009-02-25T22:54:03Zhramrach (Michal Suchanek)
<ul></ul><p>=begin<br>
2009/2/24 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p>
<blockquote>
<p>Hi Michal,</p>
<p>On Wed, 25 Feb 2009 00:20:52 +1100, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a><br>
wrote:</p>
<blockquote>
<blockquote>
<blockquote>
<p>BTW the same goes for "reading N bytes up to a character boundary"<br>
unless you are willing to accept that you might read nothing even if<br>
data is available because the character did not fit into N bytes, even<br>
for a large N.</p>
</blockquote>
<p>The current behaviour of IO#gets "limit" parameter is "read N bytes but<br>
round <em>UP</em> to the next character boundary". Therefore you may get more<br>
bytes<br>
returned than requested. As long as "limit" is 1 or more, you should<br>
always<br>
read something unless the file is at EOF, right?</p>
<p>However I still do not understand why reading N characters (instead of<br>
the<br>
current "limit" implementation) might be described as being "insecure".<br>
Can<br>
someone please explain it? It is clear that the string returned may be<br>
larger than N bytes long, but why is that "insecure"?</p>
</blockquote>
<p>The current gets is not any more secure since it rounds up. If you<br>
found a bug in an encoding that could give you an infinite character<br>
either version would break.</p>
</blockquote>
<p>Actually rounding <em>up</em> to the character boundary <em>is</em> more secure. You<br>
pointed it out yourself! If it rounded down, gets could return an empty<br>
string, and code like:</p>
<p> while s = f.gets(1) .....</p>
<p>would then go into an infinite loop.</p>
</blockquote>
<p>Then either way is insecure because you can get an infinite loop with<br>
zero read (unless zero read returned nil or threw an exception) and<br>
potentially infinite memory requirement with a broken encoding and<br>
rounding up.</p>
<p>Thanks</p>
<p>Michal</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=98362010-04-04T01:08:28Zznz (Kazuhiro NISHIYAMA)
<ul><li><strong>Category</strong> set to <i>core</i></li><li><strong>Target version</strong> set to <i>2.0.0</i></li></ul><p>=begin</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=133922010-09-14T16:47:02Zshyouhei (Shyouhei Urabe)shyouhei@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li></ul><p>=begin</p>
<p>=end</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=237622012-02-13T21:05:48Zmame (Yusuke Endoh)mame@ruby-lang.org
<ul></ul><p>Are there any volunteers to summarize the discussion?<br>
I cannot understand the whole discussion about this ticket.<br>
Redmine failed to capture some mails.</p>
<p>--<br>
Yusuke Endoh <a href="mailto:mame@tsg.ne.jp" class="email">mame@tsg.ne.jp</a></p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=316252012-10-27T04:41:45Zko1 (Koichi Sasada)
<ul><li><strong>Target version</strong> changed from <i>2.0.0</i> to <i>2.6</i></li></ul><p>I changed the target "next minor" this ticket because no response here.</p> Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O streamhttps://bugs.ruby-lang.org/issues/908?journal_id=673612017-10-19T13:03:17Zmame (Yusuke Endoh)mame@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Rejected</i></li></ul><p>I'm rejecting this issue since it has been stalled for five years. If anyone really needs it, it would be good to re-organize the discussion all over again.</p>