Text editing [Was: Re: #line decoder]

Thu Sep 25 05:41:51 PDT 2008

BCS Wrote:
> so I just wiped out a little tool to do it for me.

I think that's a job fitter for a script, for example in Python (or Ruby or Perl).

-----------------

But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements.

For example, this is "data.txt":

jan        sun     12
feb        mon     14
mar        fri     23
aug        sat     3
jun        tue     15

The purpose is to read those 3 columns as separated arrays, so the output is transposed:

[["jan", "feb", "mar", "aug", "jun"],
 ["sun", "mon", "fri", "sat", "tue"],
 ["12", "14", "23", "3", "15"]]

With Python you can do it this way (this produces immutable sub arrays), loader1:

r1 = zip(*(l.split() for l in file("data2.txt")))

If you duplicate many times the data.txt to have a bigger file (for example a 44 MB text file) like this:

>>> t = file("data.txt").read()
>>> t2 = t * 400000
>>> len(t2)
43600000
>>> file("data2.txt", "w").write(t2)

you find that transposing is too much slow. This is my faster Python+Psyco version:

# loader2
from collections import deque
def main():
    fin = file("data2.txt")
    cols = [deque([el]) for el in fin.readline().split()]
    for line in fin:
        col = 0
        for el in line.split():
            cols[col].append(el)
            col += 1
import psyco; psyco.full()
main()

My first D version:

// loader3
import std.stream;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Disabling the GC doesn't help:

// loader4
import std.stream, std.gc;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Here using an ArrayBuilder of mine increases two times because the GC wastes a large amount of time scanning pointers (I have not converted the ArrayBuilders into arrays because that operation is very fast, just a slicing, so hopefully it doesn't change the overall running time):

// loader5
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Using the ArrayBuilder and disabling the GC helps a little:

// loader6
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Using a lazy splitter generally helps a lot, this time helps a little because the lines are made of few parts:

// loader7
import std.stream;
import std.string: split;
import d.string: xsplit;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

But I don't understand why disabling the GC with the xsplit worsens the situation:

// loader8
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

Using both an ArrayBuilder and lazy splitting improves the situation significantly:

// loader9
import std.stream;
import std.string: split;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

This time disabling the GC helps a lot (see loader8 for comparison), we are now closer to the timings of Python, so lot of people at this point can probably stop trying to optimize the code:

// loader10
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

But I have done few other versions, trying to go faster than then Python version. In the following versions I have assumed the programmer knows the number of colums (3), in theory this helps the code. In theory this is very fast, because the xsplit() is very fast and there is no need to dup the elements, because you can just keep the txt string around (that read() loads quickly), and a matrix of string slices. In practice the GC kills the performance, I don't know why. Note the running time of the foreach loop is very fast (0.3-0.4 s), most of the time is used in the final deallocation:

// loader11
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
}

Disabling the GC helps:

// loader12
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    disable();
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
    enable();
}

I have tried few other mad things without good results.

Timings, data2.txt, warm timings, best of 3:
  loader1:  23.05 s
  loader2:   3.00 s
  loader3:   8.82 s
  loader4:  11.44 s
  loader5:  21.31 s
  loader6:   7.20 s
  loader7:   7.51 s
  loader8:   8.45 s
  loader9:   5.46 s
  loader10:  3.73 s
  loader11: 82.54 s
  loader12: 38.87 s

I can now probably write a pure C version, but I am too much lazy for that :-)

If you know ways to speed up the loading of such file in D I'll gladly take a look at your code.
In bioinformatics I process txt files all the time. 

Bye,
bearophile