Regex performance

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Mar 26 12:19:29 PDT 2012


On 26.03.2012 20:00, Jay Norwood wrote:
> On Sunday, 25 March 2012 at 16:31:40 UTC, James Blewitt wrote:
>> I'm currently trying to figure out what I'm doing differently in my
>> original program. At this point I am assuming that I have an error in
>> my code which causes the D program to do much more work that its Ruby
>> counterpart (although I am currently unable to find it).
>>
>> When I know more I will let you know.
>>
>> James Blewitt
>
> That was the same type of thing I was seeing with very simple regex
> expressions. The regex was on the order of 30 times slower than hand
> code for finding words in strings.

This is a sad fact of life, the general tool can't beat highly 
specialized things. Ideally it can be on par though. Even in the best 
case ctRegex has to do a lot of things a simple == '\n' doesn't do, like 
storing boundaries of match. That's something to keep in mind.

By the way, regex does fine job on (semi-)fixed strings of length >= 
3-4, often easily beating plain find/indexOf. I haven't tested 
Boyer-Moore version of find, that should be faster then regex for sure.

The ctRegex is on the order of 13x
> slower than hand code. The times below are from parallel processing on
> 100MB of text files, just finding the word boundaries. I uploaded that
> tests in https://github.com/jnorwood/wc_test
> I believe in all these cases the files are being cached by the os, since
> I was able to see the same measurements from a ramdisk done with imdisk.
> So in these cases the file reads are about 30ms of the result. The rest
> is cpu time, finding the words.
>
> This is with default 7 threads
>
> finished wcp_wcPointer! time: 98 ms
> finished wcp_wcCtRegex! time: 1300 ms
> finished wcp_wcRegex! time: 2946 ms
> finished wcp_wcRegex2! time: 2687 ms
> finished wcp_wcSlices! time: 157 ms
> finished wcp_wcStdAscii! time: 225 ms
>
>
> This is processing the same data with 1 thread
>
> finished wcp_wcPointer! time: 188 ms
> finished wcp_wcCtRegex! time: 2219 ms
> finished wcp_wcRegex! time: 5951 ms
> finished wcp_wcRegex2! time: 5502 ms
> finished wcp_wcSlices! time: 318 ms
> finished wcp_wcStdAscii! time: 446 ms
>
> And this is processing the same data with 13 threads
>
> finished wcp_wcPointer! time: 93 ms
> finished wcp_wcCtRegex! time: 1110 ms
> finished wcp_wcRegex! time: 2531 ms
> finished wcp_wcRegex2! time: 2321 ms
> finished wcp_wcSlices! time: 136 ms
> finished wcp_wcStdAscii! time: 200 ms
>
> The only change in the program that is uploaded is to add the suggested
> defaultPoolThreads(13);
> at the start of main to change the ThreadPool default thread count.
>


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list