Can I speed up this log parsing script further?

uncorroded via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Jun 9 00:34:43 PDT 2017


Hi guys,

I am a beginner in D. As a project, I converted a log-parsing 
script in Python which we use at work, to D. This link was 
helpful - ( 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ 
) I compiled it with dmd and ldc. The log file is 52 MB. With dmd 
(not release build), it takes 1.1 sec and with ldc, it takes 0.3 
sec.

The Python script (run with system python, not Pypy) takes 0.75 
sec. The D and Python functions are here and on pastebin ( D - 
https://pastebin.com/SeUR3wFP , Python - 
https://pastebin.com/F5JbfBmE ).

Basically, i am reading a line, checking for 2 constants. If 
either one is found, some processing is done on line and stored 
to an array for later analysis. I tried reading the file entirely 
in one go using std.file : readText and using std.algorithm : 
splitter for lazily splitting newline but there is no difference 
in speed, so I used the byLine approach mentioned in the linked 
blog. Is there a better way of doing this in D?

Note:
I ran GC profiling as mentioned in linked blog. The results were:

Number of collections:  3
	Total GC prep time:  0 milliseconds
	Total mark time:  0 milliseconds
	Total sweep time:  0 milliseconds
	Total page recovery time:  0 milliseconds
	Max Pause Time:  0 milliseconds
	Grand total GC time:  2 milliseconds
GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms

So GC does not seem to be an issue.

Here's the D script:

import std.stdio;
import std.string;
import std.array;
import std.algorithm : splitter;
import std.typecons : tuple, Tuple;
import std.conv : to;

void read_log(string filename) {
     File file = File(filename, "r");
     Tuple!(char[], int, char[])[] npushed;
     Tuple!(int, char[], int, bool, bool)[] pushed;
     foreach (line; file.byLine) {
         if (line.indexOf("SOC_NOT_PUSHED") != -1) {
             auto tarr = line.split();
             npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 
1]), tarr[$ - 2]);
             continue;
         }
         if (line.indexOf("SYNC_PUSH:") != -1) {
             auto rel = line.split("SYNC_PUSH:")[1].strip();
             auto att = rel.split(" at ");
             auto ina = att[1].split(" in ");
             auto msa = ina[1].split(" ms ");
             pushed ~= tuple(to!int(att[0]), ina[0], 
to!int(msa[0]),
                     msa[1].indexOf("PA-SOC_POP") != -1, 
msa[1].indexOf("CU-SOC_POP") != -1);
         }
     }
     // Using the arrays later on in production script
     writeln(npushed.length);
     writeln(pushed.length);
}


Here is Python function:

def read_log(fname):
     try:
         with open(fname, 'r') as f:
             raw = f.read().splitlines()
             ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
             ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw 
if 'SYNC_PUSH:' in w]
             not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in 
ns]
             ww = [(int(e.split(' at ')[0]), e.split(' at 
')[1].split(' in ')[0], int(e.split(' at ')[1].split(' in 
')[1].split(' ms ')[0]), set(e.split(' at ')[1].split(' in 
')[1].split(' ms ')[1].split())) for e in ss]
             pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in 
w[3] else 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
             return not_pushed, pushed
     except:
         return []



More information about the Digitalmars-d-learn mailing list