Bad performance of simple regular expression - why??

Mon Feb 5 15:47:53 PST 2007

MarcL wrote:
> hi everyone, 
> 
> first of all i want to say that i'm not a professional programmer  - so the problem i have might be 
> caused by my own  lack of experience. Nevertheless i want to describe it, hoping that someone in this 
> group might be able to tell me what's wrong. I am a molecular biologist and i often have to deal with 
> larger amounts of DNA and protein sequence data (which is in principle text). I am mainly using Perl to 
> process these DNA files, and Perl generally performs very well (regular expressions are actually the killer 
> tool for working with DNA sequences). Unfortunately not everything in Perl is a fast as the regular 
> expressions and so i started trying to learn a language that can be compiled to produce fast 
> executables : C++ - and went crazy because everything is so complicated. All the ugly details that Perl 
> takes care of for the user have to be organized manually and that really gave me the creeps. Then i 
> learned about D and it sounded like it the solution to my problem: A compilable language that supports 
> associative arrays, garbage collection and (most importantly for me) regular expressions! Great! I 
> experimented a bit and actually managed to write small working programlet directly. I was delighted! 
> But now comes the reason why i write all this: Being enthusiastic about this new nice language i started 
> to write a module that should implement basic functions for working with the most common DNA 
> sequence file formats. To parse these files i planned to use regular expressions. So far so good. When 
> testing my module with a small DNA file everything seemed OK -then i tried to use it to parse a more 
> real world-sized DNA file (~155000 characters of DNA sequence plus about the same amount of textual 
> information) and had to find out that a simple std.regexp.split call took about 59 seconds!!! I could not 
> believe it and wrote a little Perl script doing the same thing and it took less than 1s!! What's wrong 
> here??? This can't really be true, can it? Is the implementation of regular expressions in the phobos 
> library so bad or preliminary that it is so much less performant than the Perl regex engine? It's actually 
> not usable for me like this (which is a sad think because i really like the other features of D and would 
> like to use it). Am i making mistakes or do i simply have to wait for a better version of phobos? 
> 
> Any comments or suggestions would be great. 
> 
> cheers 
> 

Like Bill said, a sample would be helpful.

The most common mistake is to not precompile the regexp.
For instance using std.regexp.search with the same pattern string is not 
optimal.

If you're going to be using the same pattern multiple times, then you 
should create a RegExp object once, and then apply it multiple times 
using its search, match etc methods.

Regardless, perl was basically created as a convenient way to use 
regular expressions, so it's implementation could very likely be more 
efficient than D's.

Do perl regexp's handle unicode/utf8 properly these days?
(Actually, do D's for that matter?)

--bb