stdio performance in tango, stdlib, and perl

Tue Mar 27 10:34:12 PDT 2007

On Tue, 27 Mar 2007 16:27:57 +0200, Roberto Mariottini wrote:

> Hi,
> I have got no reply to my questions.
> Can somebody answer them?
> 
> Ciao
> 
> -------- Original Message --------
> Subject: Re: stdio performance in tango, stdlib, and perl
> Date: Fri, 23 Mar 2007 10:08:24 +0100
> From: Roberto Mariottini <rmariottini at mail.com>
> Organization: Digital Mars
> Newsgroups: digitalmars.D
> References: <4601A54A.8050307 at erdani.org> 
> <etsbup$2c5t$1 at digitalmars.com> <4601B819.6080001 at erdani.org> 
> <etse2m$2fa2$1 at digitalmars.com> <4601C25F.9050107 at erdani.org> 
> <ettem8$qgl$1 at digitalmars.com> <4602C66E.4020100 at erdani.org>
> 
> Andrei Alexandrescu (See Website For Email) wrote:
>  > Roberto Mariottini wrote:
> [...]
>  >>> Essentially it's about information. The naive loop:
>  >>>
>  >>> while (readln(line)) {
>  >>>   write(line);
>  >>> }
>  >>
>  >> I'm completely against that awful mess of code.
>  >
>  > What exactly would be bad about it?
> 
> It's not clearly evident for a non-expert programmer that a new-line is
> appended at each line.
> Take any programmer from any language of your choice and ask what this
> snippets is supposed to do.
> This is against immediate comprehension of code.

One of the small issues I have with 'readln' appending a newline
character(s) at the end of a line is that such characters are not actually
a part of the text line; they are delimiters that separate one line from
another. In essence they are the same type of thing as the null byte that
marks the ends of a C-style string. 

If the purpose of returning the newline character(s) by readln() is to
inform the caller that a complete line was actually read in, then I would
have thought that this is 'optional' data that the caller could choose to
know about or not. If I call readln() and a complete line was not read in I
would consider this an exception. And by the way, a text file that does not
terminate with a newline is not an exception in my point of view as this
could be just a situation in which a delimiting newline is not required
(there is nothing to delimit the last from).

>  >>> is guaranteed 100% to produce an accurate copy of its input. The
>  >>> version that chops lines looks like:
>  >>>
>  >>> while (readln(line)) {
>  >>>   writeln(line);
>  >>> }
>  >>>
>  >>> This may or may not add a newline to the output, possibly creating a
>  >>> file larger by one byte.
>  >>
>  >> Are you sure? Can you elaborate more on this?
>  >
>  > Very simple. If the file ends with a newline, the code reproduces it. If
>  > not, the code gratuitously appends a newline.
> 
> A newline is two bytes here.

Som reanln() implementations disregard the actual newline as supplied by
the operating system and just append a single 0x0A byte for all operating
systems. And when it comes to outputing this, it is transformed back into
the appropriate newline sequence for the running opsys.

>  >>> Moreover, with the automated chopping it is basically impossible to
>  >>> write a program that exactly reproduces its input because readln
>  >>> essentially loses information.
> 
> A text file is not a binary file.
> A newline at end of file is completely irrelevant.

Exactly. It is merely a delimiter *between* lines.

> On the other side, no code should break if the last newline is there or
> not. The problem with your code is that the last line comes different
> from the others.

The last line does not need a delimiter - so some systems make it optional.

>  >>> Also, stdio also offers a readln() that creates a new line on every
>  >>> call. That is useful if you want fresh lines every read:
>  >>>
>  >>> char[] line;
>  >>> while ((line = readln()).length > 0) {
>  >>>   ++dictionary[line];
>  >>> }
>  >>
>  >> This way you'll get two different dictionaries on Windows and on Unix.
>  >> Wrong, very wrong.
>  >
>  > Yes, wrong, very wrong. Except it's not me who's wrong :o).
> 
> Ehm, can you elaborate how good is to put a '\n' at the end of any
> string when working with:
> 
>   - databases
>   - communication programs
>   - interprocess communication
>   - distributed computing

Does not make a lot of sense to me either. Like I said earlier, the first
thing I usually do when reading a line is to remove the damned newline
character(s).

>  >>> The code _just works_ because an empty line means _precisely_ and
>  >>> without the shadow of a doubt that the file has ended. (An I/O error
>  >>> throws an exception, and does NOT return an empty line; that is
>  >>> another important point.) An API that uses automated chopping should
>  >>> not offer such a function because an empty line may mean that an
>  >>> empty line was read, or that it's eof time. So the API would force
>  >>> people to write convoluted code.
>  >>
>  >> What is your definition of "convolute"?
>  >> I find your code 'convolute', 'unclear', 'buggy' and 'unportable'.
>  >
>  > You are objectively wrong.
> 
> Say 'subjectively'.
> Assignments in boolean expressions should be avoided. The average
> programmer knows something about this magic, but fears to touch it, and
> never completely understand it.
> 
> Still, any programmer from any language would think that this code ends
> at the first empty line.
> 
> Here is one of the many possible non-convoluted versions:
> 
> char[] line = readln();
> while (line.length > 0) {
>    ++dictionary[chomp(line)];
>    line = readln();
> }
> 
> And this is how it should be:
> 
> char[] line = readln();
> while (line != null) {
>    ++dictionary[line];
>    line = readln();
> }

This depends on distinguishing between an empty line and a null line.

>  > The code is portable. Newline translation
>  > takes care of it. Just try it.
> 
> Newline translation is an old problem with C, C++ and now with D.
> Nothing can be resolved with newline translation.
> 
> Opening a file in binary mode on Unix and treating it like a text file
> works only as long as the program is run on Unix.
> Newline translation is prone to portability errors, thus non-portable.
> 
> In my experience, newline translations pose more portability problems
> than it solves.

Unless done right by the compiler/language and not having to be done by the
code writer each time. Much like a GC system.

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell