Can I speed up this log parsing script further?
Daniel Kozak via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Fri Jun 9 01:15:58 PDT 2017
I would considered using appender for pushed and npushed. Can you post file
on which you are running benchmarking?
On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn <
digitalmars-d-learn at puremagic.com> wrote:
> On 09/06/2017 8:34 AM, uncorroded wrote:
>
>> Hi guys,
>>
>> I am a beginner in D. As a project, I converted a log-parsing script in
>> Python which we use at work, to D. This link was helpful - (
>> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I
>> compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release
>> build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
>>
>> The Python script (run with system python, not Pypy) takes 0.75 sec. The
>> D and Python functions are here and on pastebin ( D -
>> https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
>>
>> Basically, i am reading a line, checking for 2 constants. If either one
>> is found, some processing is done on line and stored to an array for later
>> analysis. I tried reading the file entirely in one go using std.file :
>> readText and using std.algorithm : splitter for lazily splitting newline
>> but there is no difference in speed, so I used the byLine approach
>> mentioned in the linked blog. Is there a better way of doing this in D?
>>
>> Note:
>> I ran GC profiling as mentioned in linked blog. The results were:
>>
>> Number of collections: 3
>> Total GC prep time: 0 milliseconds
>> Total mark time: 0 milliseconds
>> Total sweep time: 0 milliseconds
>> Total page recovery time: 0 milliseconds
>> Max Pause Time: 0 milliseconds
>> Grand total GC time: 2 milliseconds
>> GC summary: 12 MB, 3 GC 2 ms, Pauses 0 ms < 0 ms
>>
>> So GC does not seem to be an issue.
>>
>> Here's the D script:
>>
>> import std.stdio;
>> import std.string;
>> import std.array;
>> import std.algorithm : splitter;
>> import std.typecons : tuple, Tuple;
>> import std.conv : to;
>>
>> void read_log(string filename) {
>> File file = File(filename, "r");
>> Tuple!(char[], int, char[])[] npushed;
>> Tuple!(int, char[], int, bool, bool)[] pushed;
>> foreach (line; file.byLine) {
>> if (line.indexOf("SOC_NOT_PUSHED") != -1) {
>> auto tarr = line.split();
>> npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]),
>> tarr[$ - 2]);
>> continue;
>> }
>> if (line.indexOf("SYNC_PUSH:") != -1) {
>> auto rel = line.split("SYNC_PUSH:")[1].strip();
>> auto att = rel.split(" at ");
>> auto ina = att[1].split(" in ");
>> auto msa = ina[1].split(" ms ");
>> pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
>> msa[1].indexOf("PA-SOC_POP") != -1,
>> msa[1].indexOf("CU-SOC_POP") != -1);
>> }
>> }
>> // Using the arrays later on in production script
>> writeln(npushed.length);
>> writeln(pushed.length);
>> }
>>
>>
>> Here is Python function:
>>
>> def read_log(fname):
>> try:
>> with open(fname, 'r') as f:
>> raw = f.read().splitlines()
>> ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
>> ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if
>> 'SYNC_PUSH:' in w]
>> not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
>> ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split('
>> in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]),
>> set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in
>> ss]
>> pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else
>> 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
>> return not_pushed, pushed
>> except:
>> return []
>>
>>
> The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln)
> is expensive. Your python code doesn't write anything to stdout (or perform
> any calls). It would also be good to get the results of dmd -release as
> well.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20170609/382b8ee8/attachment-0001.html>
More information about the Digitalmars-d-learn
mailing list