A safer File.readln

Sun Jan 22 13:29:39 PST 2017

It's pretty easy to DoS a D program that uses File.readln or 
File.byLine:

msl at james:~/d$ prlimit --as=4000000000 time ./tinycat.d tinycat.d
#!/usr/bin/rdmd

import std.stdio;

void main(in string[] argv) {
     foreach (const filename; argv[1..$])
         foreach (line; File(filename).byLine)
             writeln(line);
}

0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 
4280maxresident)k
0inputs+0outputs (0major+292minor)pagefaults 0swaps
msl at james:~/d$ prlimit --as=4000000000 time ./tinycat.d /dev/zero
0.87user 1.45system 0:02.51elapsed 92%CPU (0avgtext+0avgdata 
2100168maxresident)k
0inputs+0outputs (0major+524721minor)pagefaults 0swaps
msl at james:~/d$

This trivial program that runs in about 4MiB when asked to print 
itself chewed up 2GiB of memory in about three seconds when 
handed an infinitely long input line, and would have kept going 
if prlimit hadn't killed it.

D is in good company: C++'s getline() and Perl's diamond operator 
have the same vulnerability.

msl at james:~/d$ prlimit --as=4000000000 time ./a.out tinycat.cpp
#include <fstream>
#include <iostream>
#include <string>

int main(int const argc, char const *argv[]) {
     for (auto i = 1;  i < argc;  ++i) {
         std::ifstream fh {argv[i]};
         for (std::string line;  getline(fh, line, '\n');  )
             std::cout << line << '\n';
     }

     return 0;
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 
2652maxresident)k
0inputs+0outputs (0major+113minor)pagefaults 0swaps
msl at james:~/d$ prlimit --as=4000000000 time ./a.out /dev/zero
1.12user 1.76system 0:02.92elapsed 98%CPU (0avgtext+0avgdata 
1575276maxresident)k
0inputs+0outputs (0major+786530minor)pagefaults 0swaps
msl at james:~/d$ prlimit --as=4000000000 time perl -wpe '' tinycat.d
#!/usr/bin/rdmd

import std.stdio;

void main(in string[] argv) {
     foreach (const filename; argv[1..$])
         foreach (line; File(filename).byLine)
             writeln(line);
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 
3908maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps
msl at james:~/d$ prlimit --as=4000000000 time perl -wpe '' /dev/zero
Out of memory!
Command exited with non-zero status 1
4.82user 2.34system 0:07.43elapsed 96%CPU (0avgtext+0avgdata 
3681400maxresident)k
0inputs+0outputs (0major+919578minor)pagefaults 0swaps
msl at james:~/d$

But I digress.

What would a safer API look like?  Perhaps we'd slip in a maximum 
line length as an optional argument to readln, byLine and friends:

enum size_t MaxLength = 1 << 20;    // 1MiB
fh.readln(buf, MaxLength);
buf = fh.readln(MaxLength);
auto range = fh.byLine(MaxLength);

Obviously, we wouldn't want to break compatibility with existing 
code by demanding a maximum line length at every call site.  
Perhaps the default maximum length should change from its current 
value -- infinity -- to something like 4MiB: longer than lines in 
most text files, but still affordably small on most modern 
machines.

What should happen if readln encountered an excessively long 
line?  Throw an exception?

Markus