Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-03T10:42:49Z</p> <ul><li><strong>Assignee</strong> set to <i>akr (Akira Tanaka)</i></li></ul><p>=begin</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-03T10:44:15Z</p> <ul></ul><p>=begin<br> What's the usecases?<br> =end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-03T22:21:16Z</p> <ul></ul><p>=begin<br> Reading N characters from stream is (at least for me) as natural as reading N bytes. Usecases are almost the same as for bytes but when you want operate on characters. For example you have file encoded in utf-8 and want to read first 10 characters.<br> =end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-03T23:06:41Z</p> <ul></ul><p>=begin<br> I needed this in the standard CSV library.</p> <p>My use case was that I peek ahead in the stream to determine what kind of line endings it has. I just grab a block of characters and see if I find any standard line endings in there. This was pretty challenging in Ruby 1.9, because just reading some bytes meant I had great chances to have picked up invalid data. Then, when I hit it with a Regexp to find the line endings, an Exception is raised.</p> <p>I'm using this code to get around that problem in CSV:</p> <h1></h1> <a name="Reads-at-least-bytes-from-io-but-will-read-up-10-bytes-ahead-if"></a> <h1 >Reads at least +bytes+ from <tt>@io</tt>, but will read up 10 bytes ahead if<a href="#Reads-at-least-bytes-from-io-but-will-read-up-10-bytes-ahead-if" class="wiki-anchor">¶</a></h1> <a name="needed-to-ensure-the-data-read-is-valid-in-the-ecoding-of-that-data-This"></a> <h1 >needed to ensure the data read is valid in the ecoding of that data. This<a href="#needed-to-ensure-the-data-read-is-valid-in-the-ecoding-of-that-data-This" class="wiki-anchor">¶</a></h1> <a name="should-ensure-that-it-is-safe-to-use-regular-expressions-on-the-read-data"></a> <h1 >should ensure that it is safe to use regular expressions on the read data,<a href="#should-ensure-that-it-is-safe-to-use-regular-expressions-on-the-read-data" class="wiki-anchor">¶</a></h1> <a name="unless-it-is-actually-a-broken-encoding-The-read-data-will-be-returned-in"></a> <h1 >unless it is actually a broken encoding. The read data will be returned in<a href="#unless-it-is-actually-a-broken-encoding-The-read-data-will-be-returned-in" class="wiki-anchor">¶</a></h1> <h1> <tt>@encoding</tt>.</h1> <h1></h1> <p>def read_to_char(bytes)<br> return "" if @io.eof?<br> data = @io.read(bytes)<br> begin<br> encoded = encode_str(data)<br> raise unless encoded.valid_encoding?<br> return encoded<br> rescue # encoding error or my invalid data raise<br> if @io.eof? or data.size >= bytes + 10<br> return data<br> else<br> data += @io.read(1) until data.valid_encoding? or<br> @io.eof? or<br> data.size >= bytes + 10<br> retry<br> end<br> end<br> end</p> <p>That worked for CSV, where I just need some characters and don't have to have an exact count. If you do need an exact count though, the code gets more complicated.</p> <p>I agree that this is something Ruby should do for us.</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-04T00:47:15Z</p> <ul></ul><p>=begin<br> I wonder also about byte-oriented IO#seek if someone want to have<br> character-oriented seek. It looks like byte-oriented seekd is useless<br> in multibyte character-oriented stream because it could jump to bad<br> position (in the middle of character bytes).</p> <p>--<br> Pozdrawiam</p> <p>Radosław Bułat<br> <a href="http://radarek.jogger.pl" class="external">http://radarek.jogger.pl</a> - mój blog</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-04T08:27:54Z</p> <ul></ul><p>=begin<br> I have a simple use-case:</p> <p>Existing datafile has fixed length records, currently single-byte chars. I want to convert the application (which is quite old) to support multi-byte characters, but I don't want to have to go to the trouble of changing to variable-length or delimited records/fields. I would like to be able to read each record (whose length in chars I know) with one operation, instead of looping through each character.</p> <p>Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!<br> =end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-04T09:34:20Z</p> <ul></ul><p>=begin<br> On 04/02/2009, Michael Selig <a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a> wrote:</p> <blockquote> <p>Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!</p> </blockquote> <p>For UTF-8/UTF-16/SJIS/EUC-JP/BIG5 .. no. UTF-32 is dword aligned but<br> you cannot tell what byte ordering it uses reliably. Bad thing. Seeks<br> are probably not for text files or only for text files you have parsed<br> already so you know where you are seeking.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-04T09:38:54Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/21817">[ruby-core:21817]</a> Re: [Feature <a class="issue tracker-2 status-6 priority-4 priority-default closed" title="Feature: Should be an easy way of reading N characters from am I/O stream (Rejected)" href="https://bugs.ruby-lang.org/issues/908">#908</a>] Should be an easy way of reading N characters from am I/O stream"<br> on Wed, 4 Feb 2009 09:33:44 +0900, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p> <p>|> Re: seeking by character count - it would be nice, but I have no idea how it could be implemented efficiently!<br> |<br> |For UTF-8/UTF-16/SJIS/EUC-JP/BIG5 .. no. UTF-32 is dword aligned but<br> |you cannot tell what byte ordering it uses reliably. Bad thing. Seeks<br> |are probably not for text files or only for text files you have parsed<br> |already so you know where you are seeking.</p> <p>Right. Hence I reject the character based seek. Thank you.</p> <p>Regarding the original N character read, I am positive, but still<br> haven't decided yet for API.</p> <pre><code> matz. </code></pre> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-15T10:11:28Z</p> <ul></ul><p>=begin<br> 2009/2/14 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p> <blockquote> <p>In article <a href="mailto:op.uotab6oa9245dp@kool" class="email">op.uotab6oa9245dp@kool</a>,<br> "Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p> <blockquote> <p>That's right - These files are quite small, and I only need to do<br> sequential I/O. I want to keep the format backward compatible when using a<br> single-byte encoding.</p> </blockquote> <p>Whould you show an example of such format?</p> <p>I couldn't imagine a fixed length field which single byte<br> encoding (US-ASCII) is usable and multibyte encoding is<br> useful.</p> <p>For example, zip code or some fixed numbering system is<br> fixed length but multibyte encoding is not useful.</p> </blockquote> <p>Let's make it more general - what about the first N characters or first N lines?</p> <p>I'm sure you can understand this is useful.</p> <p>How does the lines() Enumerator interact with the IO?</p> <p>If a method like head(N) was implemented on it would it leave the IO<br> pointing to the text after the first N records, be it chars, lines, or<br> anything else?</p> <p>Can that Enumerator be created so that it starts enumerating at the<br> current file position?</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-16T19:53:03Z</p> <ul></ul><p>=begin<br> 2009/2/15 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p> <blockquote> <p>In article <a href="mailto:a5d587fb0902141711q780f0d24jef9be9b8bbe69b2a@mail.gmail.com" class="email">a5d587fb0902141711q780f0d24jef9be9b8bbe69b2a@mail.gmail.com</a>,<br> Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p> <blockquote> <blockquote> <p>For example, zip code or some fixed numbering system is<br> fixed length but multibyte encoding is not useful.</p> </blockquote> <p>Let's make it more general - what about the first N characters or first N lines?</p> <p>I'm sure you can understand this is useful.</p> </blockquote> <p>I think I don't understand the usefulness until an actual<br> example is shown.</p> <blockquote> <p>If a method like head(N) was implemented on it would it leave the IO<br> pointing to the text after the first N records, be it chars, lines, or<br> anything else?</p> </blockquote> <p>What is represented by the N chars?</p> </blockquote> <p>I don't understand the question. N chars are N chars, they do not<br> represent anything else.</p> <p>It's actually not that hard except the synchronization is not perfect.<br> By using chars and then lines I lost "F"</p> <p>irb(main):001:0> f=File.open "rom.asm"<br> => #<a href="File:rom.asm" class="external">File:rom.asm</a><br> irb(main):002:0> f.chars.take(10)<br> => ["0", "0", "0", "0", "0", "0", "0", "0", " ", " "]<br> irb(main):003:0> f.lines.take(3)<br> => ["A cli\n", "00000001 FC cld\n",<br> "00000002 66670F0115000000 o32 lgdt [dword 0x0]\n"]<br> irb(main):004:0> f.seek(0)<br> => 0<br> irb(main):005:0> f.lines.take(1)<br> => ["00000000 FA cli\n"]</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-19T00:21:43Z</p> <ul></ul><p>=begin<br> 2009/2/18 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p> <blockquote> <p>In article <a href="mailto:a5d587fb0902160252u56b50cfdv8e0fd36bb4f0b1b3@mail.gmail.com" class="email">a5d587fb0902160252u56b50cfdv8e0fd36bb4f0b1b3@mail.gmail.com</a>,<br> Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a> writes:</p> <blockquote> <blockquote> <p>What is represented by the N chars?</p> </blockquote> <p>I don't understand the question. N chars are N chars, they do not<br> represent anything else.</p> </blockquote> <p>I expect something like person's name, zip code, etc.</p> <p>However, person's name is variable length.</p> <p>The zip code (in Japan) is fixed length but multibyte<br> encoding is not useful because it uses only digits.</p> </blockquote> <p>As was explained by the original poster there are file formats similar<br> to CSV that use fixed field length instead of separators. I have<br> myself used such files, and they were in 8-bit fixed width encoding.</p> <p>However, if you want to "upgrade" your code that uses such files to<br> multibyte for international support you need reading N characters.</p> <p>Of course, the alternative is to change your code to use a different<br> format.This might make exports to and imports from legacy applications<br> hard, however.</p> <p>Sure, the export can never be perfect if the files really contain<br> internationalized data because recoding to the legacy format and<br> encoding loses some information then.</p> <blockquote> <p>I'm not sure the usage of the method for "reading N<br> characters".</p> </blockquote> <p>Yes, reading N characters does not seem very useful outside of very<br> specialized scenarios. Most sane file formats use string length in<br> bytes or separators.</p> <p>However, reading N characters, lines, or any other units for which you<br> have an IO enumerator seems useful to me.</p> <p>Actually reading N lines using the correct line separator would fetch<br> N records from the file without the need to construct a loop for that<br> (or repeat the method for reading a line N times).</p> <blockquote> <blockquote> <p>It's actually not that hard except the synchronization is not perfect.<br> By using chars and then lines I lost "F"</p> </blockquote> <p>I guess enumerator uses lookahead.</p> </blockquote> <p>That's unfortunate for using the Enumerator with other methods for<br> reading files.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-19T19:55:38Z</p> <ul></ul><p>=begin<br> 2009/2/19 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p> <blockquote> <p>In article <a href="mailto:op.upklh9q19245dp@kool" class="email">op.upklh9q19245dp@kool</a>,<br> "Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p> </blockquote> <blockquote> <blockquote> <p>Also it seems to me that the current usage of the "limit" parameter of<br> IO#gets is not intuitive in 1.9. It is "maximum number of bytes, but don't<br> split a character", and I think it should be changed to mean "maximum<br> number of chars". That would be much more obvious, more useful (IMHO), and<br> still be backward compatible with 1.8.</p> </blockquote> <p>It is introduced for security reason. bytes are more stable<br> than characters.</p> </blockquote> <p>However, the security would be served as well by a character limit.</p> <p>As I understand it this limit is introduced so that a gets does not<br> read several gigabytes of data at once in case there is no line<br> separator.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-19T20:01:49Z</p> <ul></ul><p>=begin<br> 2009/2/19 Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a>:</p> <blockquote> <p>In article <a href="mailto:op.upklh9q19245dp@kool" class="email">op.upklh9q19245dp@kool</a>,<br> "Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p> <blockquote> <p>In more detail: I have a legacy system that uses fixed length fields. Yes,<br> a name is variable length, but some old systems use a fixed length field,<br> say 40 chars, which is space filled on the right (or truncated). In my<br> case, the data input is by a form, and each field is fixed width. I am<br> changing the system so that the SAME forms can be used, but extended to<br> use UTF-8 not just ASCII. So this means that the number of characters is<br> still fixed, but the number of bytes is no longer fixed. I do <em>not</em> want<br> to change the format of the file (though it probably should be, but that<br> would be a lot more work), because I want the application to be backward<br> compatible (when using ASCII data).</p> </blockquote> <p>This is what I'd like to hear. Thank you for explanation.</p> <p>It seems the number, 40, is a number for "big enough for<br> names".</p> <p>Why don't you use 40 bytes data format, both with Ruby 1.8<br> and 1.9?</p> <p>Do you think that 40 bytes is not big enough for names in<br> some country?</p> <p>If the data format uses 40 bytes, instead of 40 chars,<br> it is easy to read it in Ruby 1.8, even if it contains UTF-8<br> chars.</p> </blockquote> <p>While this might ease working with the file data it might make<br> designing the form more challenging.</p> <p>The things to consider:</p> <ul> <li>checking for byte length rather than character length in something<br> like JavaScript (probably possible)</li> <li>explaining the length limit to the user of the application (I would<br> not want to do that)</li> <li>making sure that 40 bytes is long enough for names in languages<br> that use exotic characters</li> </ul> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-20T13:51:25Z</p> <ul></ul><p>=begin<br> At 19:59 09/02/19, Michael Selig wrote:</p> <blockquote> <p>Also there are reports reading the data which expect the data to be 40<br> characters wide. If it wasn't 40 chars, the formatting of the report may<br> screw up.</p> </blockquote> <p>Hello Michael,</p> <p>In general, I agree that being able to work with character numbers<br> is desirable. The implementation isn't exactly easy, but I hope<br> eventually we will get there. My current guess is that this might<br> mean that we have to move IO and related stuff a bit more towards<br> a model with classes stacked on top of each other. But that's just<br> a guess.</p> <p>But regarding your point of format screwup, measuring things in<br> characters won't help. Assuming that each character has the same<br> width just doesn't carry very far if you look at all the scripts<br> around the world.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-20T13:51:37Z</p> <ul></ul><p>=begin<br> At 19:00 09/02/19, Tanaka Akira wrote:</p> <blockquote> <p>It seems the number, 40, is a number for "big enough for<br> names".</p> <p>Why don't you use 40 bytes data format, both with Ruby 1.8<br> and 1.9?</p> <p>Do you think that 40 bytes is not big enough for names in<br> some country?</p> </blockquote> <p>Very much so. A typical example would be Georgia, where<br> many names are as long as some of the longer ones in<br> Europe, but they require 3 bytes per character.</p> <blockquote> <blockquote> <p>Also it seems to me that the current usage of the "limit" parameter of<br> IO#gets is not intuitive in 1.9. It is "maximum number of bytes, but don't<br> split a character", and I think it should be changed to mean "maximum<br> number of chars". That would be much more obvious, more useful (IMHO), and<br> still be backward compatible with 1.8.</p> </blockquote> <p>It is introduced for security reason. bytes are more stable<br> than characters.</p> </blockquote> <p>Can you give more specific explanations of why reading a number<br> of characters might not be secure?</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-23T17:34:39Z</p> <ul></ul><p>=begin<br> At 01:00 09/02/23, Tanaka Akira wrote:</p> <blockquote> <p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br> Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p> </blockquote> <blockquote> <blockquote> <p>Can you give more specific explanations of why reading a number<br> of characters might not be secure?</p> </blockquote> <p>I considered ISO-2022-JP, Unicode combining characters and<br> Punycode.</p> <p>In these encodings, fixed number of characters doesn't limit<br> the number of bytes.</p> </blockquote> <p>Why do you think there is a need to limit the number of bytes?<br> In general, that's not how Ruby works, at least not as far as<br> I understand.</p> <p>Regards, Martin.</p> <blockquote> <h2>However they may not cause problem now because Ruby doesn't<br> support combining characters, etc. But Ruby's encoding<br> system is extensible. It is possible to define an encoding<br> which makes the character-wise limit insecure.</h2> <p>Tanaka Akira</p> </blockquote> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-23T19:02:59Z</p> <ul></ul><p>=begin<br> 2009/2/22 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p> <blockquote> <p>On Mon, 23 Feb 2009 03:00:41 +1100, Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a> wrote:</p> <blockquote> <p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br> Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p> <blockquote> <p>Can you give more specific explanations of why reading a number<br> of characters might not be secure?</p> </blockquote> <p>I considered ISO-2022-JP, Unicode combining characters and<br> Punycode.</p> <p>In these encodings, fixed number of characters doesn't limit<br> the number of bytes.</p> </blockquote> <p>Sure, but how does that make it "insecure"?</p> <blockquote> <p>However they may not cause problem now because Ruby doesn't<br> support combining characters, etc. But Ruby's encoding<br> system is extensible. It is possible to define an encoding<br> which makes the character-wise limit insecure.</p> </blockquote> <p>Sorry, I do not really understand what you mean by the word "insecure".<br> Perhaps you could explain what you mean in more detail.<br> Also I still do not understand why you say the character limit might be<br> "insecure". Can you give an example, please?</p> </blockquote> <p>Theoretically if separate (combining) character accents are considered<br> part of the character then a character might be quite long - I guess<br> about ten codepoints which can be themselves up to six bytes. However,<br> the number of accents one can put together should be limited to<br> meaningful combinations so this should still be secure - as long as<br> the code which determines what is a valid character does not have<br> bugs. This might be tricky in some cases, though.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-23T19:36:17Z</p> <ul></ul><p>=begin<br> 2009/2/23 Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a>:</p> <blockquote> <p>2009/2/22 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p> <blockquote> <p>On Mon, 23 Feb 2009 03:00:41 +1100, Tanaka Akira <a href="mailto:akr@fsij.org" class="email">akr@fsij.org</a> wrote:</p> <blockquote> <p>In article <a href="mailto:6.0.0.20.2.20090220134502.0823ee98@localhost" class="email">6.0.0.20.2.20090220134502.0823ee98@localhost</a>,<br> Martin Duerst <a href="mailto:duerst@it.aoyama.ac.jp" class="email">duerst@it.aoyama.ac.jp</a> writes:</p> <blockquote> <p>Can you give more specific explanations of why reading a number<br> of characters might not be secure?</p> </blockquote> <p>I considered ISO-2022-JP, Unicode combining characters and<br> Punycode.</p> <p>In these encodings, fixed number of characters doesn't limit<br> the number of bytes.</p> </blockquote> <p>Sure, but how does that make it "insecure"?</p> <blockquote> <p>However they may not cause problem now because Ruby doesn't<br> support combining characters, etc. But Ruby's encoding<br> system is extensible. It is possible to define an encoding<br> which makes the character-wise limit insecure.</p> </blockquote> <p>Sorry, I do not really understand what you mean by the word "insecure".<br> Perhaps you could explain what you mean in more detail.<br> Also I still do not understand why you say the character limit might be<br> "insecure". Can you give an example, please?</p> </blockquote> <p>Theoretically if separate (combining) character accents are considered<br> part of the character then a character might be quite long - I guess<br> about ten codepoints which can be themselves up to six bytes. However,<br> the number of accents one can put together should be limited to<br> meaningful combinations so this should still be secure - as long as<br> the code which determines what is a valid character does not have<br> bugs. This might be tricky in some cases, though.</p> </blockquote> <p>BTW the same goes for "reading N bytes up to a character boundary"<br> unless you are willing to accept that you might read nothing even if<br> data is available because the character did not fit into N bytes, even<br> for a large N.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-24T10:07:56Z</p> <ul></ul><p>=begin<br> At 19:02 09/02/23, Michal Suchanek wrote:</p> <blockquote> <p>Theoretically if separate (combining) character accents are considered<br> part of the character then a character might be quite long - I guess<br> about ten codepoints</p> </blockquote> <p>For actual real-life examples, much less than that.</p> <p>For Indic grapheme clusters (which use mostly base characters,<br> not combining characters), the number can indeed get to around 10.</p> <p>In theory, there is no limitation as to how many combining characters<br> can follow a base character.</p> <blockquote> <p>which can be themselves up to six bytes.</p> </blockquote> <p>No, just up to four.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-24T22:21:58Z</p> <ul></ul><p>=begin<br> 2009/2/24 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p> <blockquote> <p>On Mon, 23 Feb 2009 21:35:30 +1100, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a><br> wrote:</p> <blockquote> <blockquote> <p>Theoretically if separate (combining) character accents are considered<br> part of the character then a character might be quite long - I guess<br> about ten codepoints which can be themselves up to six bytes. However,<br> the number of accents one can put together should be limited to<br> meaningful combinations so this should still be secure - as long as<br> the code which determines what is a valid character does not have<br> bugs. This might be tricky in some cases, though.</p> </blockquote> <p>BTW the same goes for "reading N bytes up to a character boundary"<br> unless you are willing to accept that you might read nothing even if<br> data is available because the character did not fit into N bytes, even<br> for a large N.</p> </blockquote> <p>The current behaviour of IO#gets "limit" parameter is "read N bytes but<br> round <em>UP</em> to the next character boundary". Therefore you may get more bytes<br> returned than requested. As long as "limit" is 1 or more, you should always<br> read something unless the file is at EOF, right?</p> <p>However I still do not understand why reading N characters (instead of the<br> current "limit" implementation) might be described as being "insecure". Can<br> someone please explain it? It is clear that the string returned may be<br> larger than N bytes long, but why is that "insecure"?</p> </blockquote> <p>The current gets is not any more secure since it rounds up. If you<br> found a bug in an encoding that could give you an infinite character<br> either version would break.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2009-02-25T22:54:03Z</p> <ul></ul><p>=begin<br> 2009/2/24 Michael Selig <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a>:</p> <blockquote> <p>Hi Michal,</p> <p>On Wed, 25 Feb 2009 00:20:52 +1100, Michal Suchanek <a href="mailto:hramrach@centrum.cz" class="email">hramrach@centrum.cz</a><br> wrote:</p> <blockquote> <blockquote> <blockquote> <p>BTW the same goes for "reading N bytes up to a character boundary"<br> unless you are willing to accept that you might read nothing even if<br> data is available because the character did not fit into N bytes, even<br> for a large N.</p> </blockquote> <p>The current behaviour of IO#gets "limit" parameter is "read N bytes but<br> round <em>UP</em> to the next character boundary". Therefore you may get more<br> bytes<br> returned than requested. As long as "limit" is 1 or more, you should<br> always<br> read something unless the file is at EOF, right?</p> <p>However I still do not understand why reading N characters (instead of<br> the<br> current "limit" implementation) might be described as being "insecure".<br> Can<br> someone please explain it? It is clear that the string returned may be<br> larger than N bytes long, but why is that "insecure"?</p> </blockquote> <p>The current gets is not any more secure since it rounds up. If you<br> found a bug in an encoding that could give you an infinite character<br> either version would break.</p> </blockquote> <p>Actually rounding <em>up</em> to the character boundary <em>is</em> more secure. You<br> pointed it out yourself! If it rounded down, gets could return an empty<br> string, and code like:</p> <p> while s = f.gets(1) .....</p> <p>would then go into an infinite loop.</p> </blockquote> <p>Then either way is insecure because you can get an infinite loop with<br> zero read (unless zero read returned nil or threw an exception) and<br> potentially infinite memory requirement with a broken encoding and<br> rounding up.</p> <p>Thanks</p> <p>Michal</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2010-04-04T01:08:28Z</p> <ul><li><strong>Category</strong> set to <i>core</i></li><li><strong>Target version</strong> set to <i>2.0.0</i></li></ul><p>=begin</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2010-09-14T16:47:02Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li></ul><p>=begin</p> <p>=end</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2012-02-13T21:05:48Z</p> <ul></ul><p>Are there any volunteers to summarize the discussion?<br> I cannot understand the whole discussion about this ticket.<br> Redmine failed to capture some mails.</p> <p>--<br> Yusuke Endoh <a href="mailto:mame@tsg.ne.jp" class="email">mame@tsg.ne.jp</a></p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2012-10-27T04:41:45Z</p> <ul><li><strong>Target version</strong> changed from <i>2.0.0</i> to <i>2.6</i></li></ul><p>I changed the target "next minor" this ticket because no response here.</p> </article> <article> <h1>Ruby master - Feature #908: Should be an easy way of reading N characters from am I/O stream</h1> <p>2017-10-19T13:03:17Z</p> <ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Rejected</i></li></ul><p>I'm rejecting this issue since it has been stalled for five years. If anyone really needs it, it would be good to re-organize the discussion all over again.</p> </article> </main></body></html>