<div dir="ltr">I would considered using appender for pushed and npushed. Can you post file on which you are running benchmarking?</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn <span dir="ltr"><<a href="mailto:digitalmars-d-learn@puremagic.com" target="_blank">digitalmars-d-learn@puremagic.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 09/06/2017 8:34 AM, uncorroded wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi guys,<br>

<br>

I am a beginner in D. As a project, I converted a log-parsing script in Python which we use at work, to D. This link was helpful - ( <a href="https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/" rel="noreferrer" target="_blank">https://dlang.org/blog/2017/05<wbr>/24/faster-command-line-tools-<wbr>in-d/</a> ) I compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.<br>

<br>

The Python script (run with system python, not Pypy) takes 0.75 sec. The D and Python functions are here and on pastebin ( D - <a href="https://pastebin.com/SeUR3wFP" rel="noreferrer" target="_blank">https://pastebin.com/SeUR3wFP</a> , Python - <a href="https://pastebin.com/F5JbfBmE" rel="noreferrer" target="_blank">https://pastebin.com/F5JbfBmE</a> ).<br>

<br>

Basically, i am reading a line, checking for 2 constants. If either one is found, some processing is done on line and stored to an array for later analysis. I tried reading the file entirely in one go using std.file : readText and using std.algorithm : splitter for lazily splitting newline but there is no difference in speed, so I used the byLine approach mentioned in the linked blog. Is there a better way of doing this in D?<br>

<br>

Note:<br>

I ran GC profiling as mentioned in linked blog. The results were:<br>

<br>

Number of collections:  3<br>

     Total GC prep time:  0 milliseconds<br>

     Total mark time:  0 milliseconds<br>

     Total sweep time:  0 milliseconds<br>

     Total page recovery time:  0 milliseconds<br>

     Max Pause Time:  0 milliseconds<br>

     Grand total GC time:  2 milliseconds<br>

GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms<br>

<br>

So GC does not seem to be an issue.<br>

<br>

Here's the D script:<br>

<br>

import std.stdio;<br>

import std.string;<br>

import std.array;<br>

import std.algorithm : splitter;<br>

import std.typecons : tuple, Tuple;<br>

import std.conv : to;<br>

<br>

void read_log(string filename) {<br>

     File file = File(filename, "r");<br>

     Tuple!(char[], int, char[])[] npushed;<br>

     Tuple!(int, char[], int, bool, bool)[] pushed;<br>

     foreach (line; file.byLine) {<br>

         if (line.indexOf("SOC_NOT_PUSHED"<wbr>) != -1) {<br>

             auto tarr = line.split();<br>

             npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]), tarr[$ - 2]);<br>

             continue;<br>

         }<br>

         if (line.indexOf("SYNC_PUSH:") != -1) {<br>

             auto rel = line.split("SYNC_PUSH:")[1].st<wbr>rip();<br>

             auto att = rel.split(" at ");<br>

             auto ina = att[1].split(" in ");<br>

             auto msa = ina[1].split(" ms ");<br>

             pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),<br>

                     msa[1].indexOf("PA-SOC_POP") != -1, msa[1].indexOf("CU-SOC_POP") != -1);<br>

         }<br>

     }<br>

     // Using the arrays later on in production script<br>

     writeln(npushed.length);<br>

     writeln(pushed.length);<br>

}<br>

<br>

<br>

Here is Python function:<br>

<br>

def read_log(fname):<br>

     try:<br>

         with open(fname, 'r') as f:<br>

             raw = f.read().splitlines()<br>

             ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]<br>

             ss = [w.split("SYNC_PUSH:")[1].stri<wbr>p() for w in raw if 'SYNC_PUSH:' in w]<br>

             not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]<br>

             ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split(' in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]), set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in ss]<br>

             pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]<br>

             return not_pushed, pushed<br>

     except:<br>

         return []<br>

<br>

</blockquote>

<br></div></div>

The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.<br>

</blockquote></div><br></div>