stdio performance in tango, stdlib, and perl

Wed Mar 21 16:40:15 PDT 2007

kris wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> kris wrote:
>>
>>> Andrei Alexandrescu (See Website For Email) wrote:
>>>
>>>> 13.9s        Tango
>>>> 6.6s        Perl
>>>> 5.0s        std.stdio
>>>
>>>
>>>
>>> There's a couple of things to look at here:
>>>
>>> 1) if there's an idiom in tango.io, it would be rewriting the example 
>>> like this:  Cout.conduit.copy (Cin.conduit)
>>
>> The test code assumed taking a look at each line before printing it, 
>> so speed of line reading and writing was deemed as important, not 
>> speed of raw I/O, which we all know how to get.
> 
> Yep, just trying to isolate things
> 
>>> 3) the test would appear to be stressing the parsing of lines just as 
>>> much (if not more) than the io system itself. All part-and-parcel to 
>>> a degree, but it may be worth investigating
>>
>>
>> I don't understand this.
> 
> Just suggesting that the scanning for [\r]\n patterns is likely a good 
> chunk of the CPU time
> 
>>> b) foregoing the output .newline, purely as an experiment
>>
>>
>> 4.7s    tcat
> 
> Thanks. If tango.io were to retain CR on readln, then it would come out 
> ahead of everything else in this particular test

Well probably but must be tested. Newlines comprise about 3% of the file 
size.

> Can you distill the benefits of retaining CR on a readline, please?

I am pasting fragments from an email to Walter. He suggested this at a 
point, and I managed to persuade him to keep the newline in there.

Essentially it's about information. The naive loop:

while (readln(line)) {
   write(line);
}

is guaranteed 100% to produce an accurate copy of its input. The version 
that chops lines looks like:

while (readln(line)) {
   writeln(line);
}

This may or may not add a newline to the output, possibly creating a 
file larger by one byte. This is the kind of imprecision that makes the 
difference between a well-designed API and an almost-good one. Moreover, 
with the automated chopping it is basically impossible to write a 
program that exactly reproduces its input because readln essentially 
loses information.

Also, stdio also offers a readln() that creates a new line on every 
call. That is useful if you want fresh lines every read:

char[] line;
while ((line = readln()).length > 0) {
   ++dictionary[line];
}

The code _just works_ because an empty line means _precisely_ and 
without the shadow of a doubt that the file has ended. (An I/O error 
throws an exception, and does NOT return an empty line; that is another 
important point.) An API that uses automated chopping should not offer 
such a function because an empty line may mean that an empty line was 
read, or that it's eof time. So the API would force people to write 
convoluted code.

In the couple of years I've used Perl I've thanked the Perl folks for 
their readline decision numerous times.

Ever tried to do cin or fscanf? You can't do any intelligent input with 
them because they skip whitespace and newlines like it's out of style. 
All of my C++ applications use getline() or fgets() (both of which 
thankfully do include the newline) and then process the line in-situ.

Andrei