Can I speed up this log parsing script further?

Daniel Kozak via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Jun 9 01:15:58 PDT 2017


I would considered using appender for pushed and npushed. Can you post file
on which you are running benchmarking?

On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn <
digitalmars-d-learn at puremagic.com> wrote:

> On 09/06/2017 8:34 AM, uncorroded wrote:
>
>> Hi guys,
>>
>> I am a beginner in D. As a project, I converted a log-parsing script in
>> Python which we use at work, to D. This link was helpful - (
>> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I
>> compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release
>> build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
>>
>> The Python script (run with system python, not Pypy) takes 0.75 sec. The
>> D and Python functions are here and on pastebin ( D -
>> https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
>>
>> Basically, i am reading a line, checking for 2 constants. If either one
>> is found, some processing is done on line and stored to an array for later
>> analysis. I tried reading the file entirely in one go using std.file :
>> readText and using std.algorithm : splitter for lazily splitting newline
>> but there is no difference in speed, so I used the byLine approach
>> mentioned in the linked blog. Is there a better way of doing this in D?
>>
>> Note:
>> I ran GC profiling as mentioned in linked blog. The results were:
>>
>> Number of collections:  3
>>      Total GC prep time:  0 milliseconds
>>      Total mark time:  0 milliseconds
>>      Total sweep time:  0 milliseconds
>>      Total page recovery time:  0 milliseconds
>>      Max Pause Time:  0 milliseconds
>>      Grand total GC time:  2 milliseconds
>> GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms
>>
>> So GC does not seem to be an issue.
>>
>> Here's the D script:
>>
>> import std.stdio;
>> import std.string;
>> import std.array;
>> import std.algorithm : splitter;
>> import std.typecons : tuple, Tuple;
>> import std.conv : to;
>>
>> void read_log(string filename) {
>>      File file = File(filename, "r");
>>      Tuple!(char[], int, char[])[] npushed;
>>      Tuple!(int, char[], int, bool, bool)[] pushed;
>>      foreach (line; file.byLine) {
>>          if (line.indexOf("SOC_NOT_PUSHED") != -1) {
>>              auto tarr = line.split();
>>              npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]),
>> tarr[$ - 2]);
>>              continue;
>>          }
>>          if (line.indexOf("SYNC_PUSH:") != -1) {
>>              auto rel = line.split("SYNC_PUSH:")[1].strip();
>>              auto att = rel.split(" at ");
>>              auto ina = att[1].split(" in ");
>>              auto msa = ina[1].split(" ms ");
>>              pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
>>                      msa[1].indexOf("PA-SOC_POP") != -1,
>> msa[1].indexOf("CU-SOC_POP") != -1);
>>          }
>>      }
>>      // Using the arrays later on in production script
>>      writeln(npushed.length);
>>      writeln(pushed.length);
>> }
>>
>>
>> Here is Python function:
>>
>> def read_log(fname):
>>      try:
>>          with open(fname, 'r') as f:
>>              raw = f.read().splitlines()
>>              ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
>>              ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if
>> 'SYNC_PUSH:' in w]
>>              not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
>>              ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split('
>> in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]),
>> set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in
>> ss]
>>              pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else
>> 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
>>              return not_pushed, pushed
>>      except:
>>          return []
>>
>>
> The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln)
> is expensive. Your python code doesn't write anything to stdout (or perform
> any calls). It would also be good to get the results of dmd -release as
> well.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20170609/382b8ee8/attachment-0001.html>


More information about the Digitalmars-d-learn mailing list