Phobos usability with text files

Tue Dec 28 04:39:24 PST 2010

On Sun, 26 Dec 2010 12:53:29 -0500
Michel Fortin <michel.fortin at michelf.com> wrote:

> On 2010-12-26 12:13:41 -0500, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> said:
> 
> > On 12/26/10 10:12 AM, bearophile wrote:
> >> This is related to this (closed) issue, but this time I prefer to 
> >> discuss the topic on the newsgroup:
> >> http://d.puremagic.com/issues/show_bug.cgi?id=4474
> >> 
> >> To show why this enhancement request is useful I use a little scripting 
> >> task. A D2 program has to read a file that contains one word in each 
> >> line, and print all the words with the highest length.
> >> 
> >> A first naive functional style implementation (if you keep in mind that 
> >> File opens files in binary mode on default, this default is different 
> >> from the Python one, I am not sure what's the best default. I think 
> >> most files I open are text ones, so I think a bit better default is to 
> >> open in text mode on default), it doesn't work, it produces an access 
> >> violation:
> >> 
> >> // #1
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >>      auto lazyWords = File("words.txt", "r").byLine();
> >>      auto maxLen = reduce!max(map!"a.length"(lazyWords));
> >>      writeln(filter!((w){ return w.length == maxLen; })(lazyWords));
> >> }
> >> 
> >> 
> >> Finding the max length consumes the lazy iterable, so if the text file 
> >> is small a possible functional style solution is to convert the lazy 
> >> iterable into an array (other functional style solutions are possible, 
> >> like reading the file twice, etc):
> >> 
> >> // #2
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >>      auto lazyWords = File("words.txt", "r").byLine();
> >>      auto words = array(lazyWords);
> >>      auto maxLen = reduce!max(map!"a.length"(words));
> >>      writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >> 
> >> But #2 doesn't work, because while "words" is an array of the words, 
> >> byLine() uses the same buffer, so words is a sequence of useless stuff.
> >> 
> >> To debug that code you need to dup each array item:
> >> 
> >> // #3
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >>      auto lazyWords = File("words.txt", "r").byLine();
> >>      auto words = array(map!"a.dup"(lazyWords));
> >>      auto maxLen = reduce!max(map!"a.length"(words));
> >>      writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >> 
> >> 
> >> #3 works, but for scripting-line programs it's not the first version I 
> >> write, I have had to debug the code. I think other programmers will 
> >> have the same troubles. So I think byLine() default behavour is a bit 
> >> bug-prone. Re-using the same buffer gives a good performance boost, but 
> >> it's often a premature optimization in scripts. I prefer a language and 
> >> library to use unsafe optimizations only on request.
> >> 
> >> My suggested solution is to split File.byLine() in two member 
> >> functions, one returns a lazy iterable that reuses the line buffer and 
> >> one that doesn't reuse them. And my preferred solution is to give the 
> >> shorter name (so it becomes the "default") to the method that dups the 
> >> line, because this is the safer (less bug prone) behaviour. This is 
> >> also in accord with D Zen. They may be named byLine() and byFastLine() 
> >> or something similar.
> >> 
> >> If you prefer the "default" one to be the less safe version, then the 
> >> safe version may be named byDupLine():
> >> 
> >> // #4
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >>      auto lazyWords = File("words.txt", "r").byDupLine();
> >>      auto lines = array(lazyWords);
> >>      auto maxLen = reduce!max(map!"a.length"(words));
> >>      writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >> 
> >> 
> >> Using the "maxs" (from 
> >> http://d.puremagic.com/issues/show_bug.cgi?id=4705 ), the code becomes:
> >> 
> >> // #5
> >> import std.stdio, std.algorithm;
> >> void main() {
> >>      auto lazyWords = File("words.txt", "r").byDupLine();
> >>      writeln(maxs!"a.length"(lazyWords));
> >> }
> > 
> > But you don't need a new string for each line to evaluate max over line 
> > lengths; the current byLine works.
> 
> That's true.	
> 
> 
> > Generally I think buffer reuse in byLine() is too valuable to let go.
> 
> I also agree it's wasteful.

That's true, indeed. But here the point, I guess, is that you need the reuse the line sequence (whatever it is), as argument to filter.

> But I think bearophile's experiment has illustrated two noteworthy 
> problems. The first issue is that calling filter! on the 
> already-consumed result of byLine() gives you a seg fault. I reproduced 
> this locally, but haven't pinpointed the problem.
> 
> The second one is this:
> 
> 	array(file.byline())
> 	
> which gives a wrong result because of the buffer reuse. Either it 
> should not compile or it should idup every line (both of which are not 
> entirely satisfactory, but they're better than getting wrong results).

I tried it myself, and the result is very strange:
unittest {
    // text.txt holds: "abc\ndefgh\nijk\nlmn\nopqrs\ntuvwx\nyz\n"
    auto lazyWords = File("test.txt").byLine();
//~     auto words = Array(map!((ll){return ll.dup;})(lazyWords));
    auto words = Array(lazyWords);
    writeln(words);
    auto lengths = map!((l){return l.length;})(words);
    writeln(lengths);
    auto maxLength = reduce!max(lengths);
    writeln(maxLength);
    auto longWords = filter!((l){return l.length==maxLength;})(words);
    writeln(longWords);
}
==>
[yz
, yz
wx, yz
, yz
, yz
wx, yz
wx, yz]
[3, 5, 3, 3, 5, 5, 2]
5
[yz
wx, yz
wx, yz
wx]

I don't understand why/how, asp that lengths is correct ;-)
When replacing the "auto words = ..." statements, we get as expected:
[abc, defgh, ijk, lmn, opqrs, tuvwx, yz]
[3, 5, 3, 3, 5, 5, 2]
5
[defgh, opqrs, tuvwx]

> I think a range should be able to express if the value can be reused or 
> not. If the value can't be reused, then the algorithm should either not 
> instantiate, or in some cases it might create a copy.

Makes sense. I also like Bearophile's proposal. But would let 'byLines' for the lazy/single-iteration-only version and call the dupping method simply 'lines'. For me (is it just me?), the latter correctly expresses an array-like collection of elements one can safely (re)use.

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com