Phobos usability with text files
spir
denis.spir at gmail.com
Tue Dec 28 04:39:24 PST 2010
On Sun, 26 Dec 2010 12:53:29 -0500
Michel Fortin <michel.fortin at michelf.com> wrote:
> On 2010-12-26 12:13:41 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
> > On 12/26/10 10:12 AM, bearophile wrote:
> >> This is related to this (closed) issue, but this time I prefer to
> >> discuss the topic on the newsgroup:
> >> http://d.puremagic.com/issues/show_bug.cgi?id=4474
> >>
> >> To show why this enhancement request is useful I use a little scripting
> >> task. A D2 program has to read a file that contains one word in each
> >> line, and print all the words with the highest length.
> >>
> >> A first naive functional style implementation (if you keep in mind that
> >> File opens files in binary mode on default, this default is different
> >> from the Python one, I am not sure what's the best default. I think
> >> most files I open are text ones, so I think a bit better default is to
> >> open in text mode on default), it doesn't work, it produces an access
> >> violation:
> >>
> >> // #1
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >> auto lazyWords = File("words.txt", "r").byLine();
> >> auto maxLen = reduce!max(map!"a.length"(lazyWords));
> >> writeln(filter!((w){ return w.length == maxLen; })(lazyWords));
> >> }
> >>
> >>
> >> Finding the max length consumes the lazy iterable, so if the text file
> >> is small a possible functional style solution is to convert the lazy
> >> iterable into an array (other functional style solutions are possible,
> >> like reading the file twice, etc):
> >>
> >> // #2
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >> auto lazyWords = File("words.txt", "r").byLine();
> >> auto words = array(lazyWords);
> >> auto maxLen = reduce!max(map!"a.length"(words));
> >> writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >>
> >> But #2 doesn't work, because while "words" is an array of the words,
> >> byLine() uses the same buffer, so words is a sequence of useless stuff.
> >>
> >> To debug that code you need to dup each array item:
> >>
> >> // #3
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >> auto lazyWords = File("words.txt", "r").byLine();
> >> auto words = array(map!"a.dup"(lazyWords));
> >> auto maxLen = reduce!max(map!"a.length"(words));
> >> writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >>
> >>
> >> #3 works, but for scripting-line programs it's not the first version I
> >> write, I have had to debug the code. I think other programmers will
> >> have the same troubles. So I think byLine() default behavour is a bit
> >> bug-prone. Re-using the same buffer gives a good performance boost, but
> >> it's often a premature optimization in scripts. I prefer a language and
> >> library to use unsafe optimizations only on request.
> >>
> >> My suggested solution is to split File.byLine() in two member
> >> functions, one returns a lazy iterable that reuses the line buffer and
> >> one that doesn't reuse them. And my preferred solution is to give the
> >> shorter name (so it becomes the "default") to the method that dups the
> >> line, because this is the safer (less bug prone) behaviour. This is
> >> also in accord with D Zen. They may be named byLine() and byFastLine()
> >> or something similar.
> >>
> >> If you prefer the "default" one to be the less safe version, then the
> >> safe version may be named byDupLine():
> >>
> >> // #4
> >> import std.stdio, std.array, std.algorithm;
> >> void main() {
> >> auto lazyWords = File("words.txt", "r").byDupLine();
> >> auto lines = array(lazyWords);
> >> auto maxLen = reduce!max(map!"a.length"(words));
> >> writeln(filter!((w){ return w.length == maxLen; })(words));
> >> }
> >>
> >>
> >> Using the "maxs" (from
> >> http://d.puremagic.com/issues/show_bug.cgi?id=4705 ), the code becomes:
> >>
> >> // #5
> >> import std.stdio, std.algorithm;
> >> void main() {
> >> auto lazyWords = File("words.txt", "r").byDupLine();
> >> writeln(maxs!"a.length"(lazyWords));
> >> }
> >
> > But you don't need a new string for each line to evaluate max over line
> > lengths; the current byLine works.
>
> That's true.
>
>
> > Generally I think buffer reuse in byLine() is too valuable to let go.
>
> I also agree it's wasteful.
That's true, indeed. But here the point, I guess, is that you need the reuse the line sequence (whatever it is), as argument to filter.
> But I think bearophile's experiment has illustrated two noteworthy
> problems. The first issue is that calling filter! on the
> already-consumed result of byLine() gives you a seg fault. I reproduced
> this locally, but haven't pinpointed the problem.
>
> The second one is this:
>
> array(file.byline())
>
> which gives a wrong result because of the buffer reuse. Either it
> should not compile or it should idup every line (both of which are not
> entirely satisfactory, but they're better than getting wrong results).
I tried it myself, and the result is very strange:
unittest {
// text.txt holds: "abc\ndefgh\nijk\nlmn\nopqrs\ntuvwx\nyz\n"
auto lazyWords = File("test.txt").byLine();
//~ auto words = Array(map!((ll){return ll.dup;})(lazyWords));
auto words = Array(lazyWords);
writeln(words);
auto lengths = map!((l){return l.length;})(words);
writeln(lengths);
auto maxLength = reduce!max(lengths);
writeln(maxLength);
auto longWords = filter!((l){return l.length==maxLength;})(words);
writeln(longWords);
}
==>
[yz
, yz
wx, yz
, yz
, yz
wx, yz
wx, yz]
[3, 5, 3, 3, 5, 5, 2]
5
[yz
wx, yz
wx, yz
wx]
I don't understand why/how, asp that lengths is correct ;-)
When replacing the "auto words = ..." statements, we get as expected:
[abc, defgh, ijk, lmn, opqrs, tuvwx, yz]
[3, 5, 3, 3, 5, 5, 2]
5
[defgh, opqrs, tuvwx]
> I think a range should be able to express if the value can be reused or
> not. If the value can't be reused, then the algorithm should either not
> instantiate, or in some cases it might create a copy.
Makes sense. I also like Bearophile's proposal. But would let 'byLines' for the lazy/single-iteration-only version and call the dupping method simply 'lines'. For me (is it just me?), the latter correctly expresses an array-like collection of elements one can safely (re)use.
Denis
-- -- -- -- -- -- --
vit esse estrany ☣
spir.wikidot.com
More information about the Digitalmars-d
mailing list