Bad performance of simple regular expression - why??

Mon Feb 5 22:21:08 PST 2007

Bill Lear Wrote:

> MarcL <lohse at mpimp-golm.mpg.de> writes:
> 
> > .... Am i making mistakes or do i simply have to wait for a better
> > version of phobos?
> > 
> > Any comments or suggestions would be great. 
> 
> Please post a representative sample so we can help you, and do please
> try to post lines < 80 characters long, if possible.
> 
> 
> Bill
> --
> Bill Lear
> r * e * @ * o * y * a * c * m
> * a * l * z * p * r * . * o *

first of all thanks a lot to everyone who answered on my
posting i did not expect to receive feedback so quickly.

This is the piece of code that runs so slowly:

<code>

	char[] raw_sequence, stripped_sequence, header_segment, feature_segment;
	char[][] gb_segments, seq_segments;

	raw_sequence = cast(char[])std.file.read("/Users/marc/Desktop/sequences.gb");
	gb_segments = std.regexp.split(raw_sequence, "FEATURES", "");
	writefln("split into ", gb_segments.length, " segments");
	seq_segments = std.regexp.split(gb_segments[1], "ORIGIN", "");
	header_segment = gb_segments[0];
	feature_segment = std.regexp.sub(seq_segments[0], "^.*location/qualifiers\n", "", "i");
	stripped_sequence = std.string.toupper(std.regexp.sub(seq_segments[1], "[0-9\t \n/]", "", "g"));

</code>

I did not precompile the regular expression, so maybe this 
is one of the reasons why it's slow, but it doesn't contain any loops
so the expressions are only used once. If anyone wants to try that
on his own machine - i attached the "sequences.gb" file.

I have tested the code on a PowerBook (G4 1,5GHz) using 
gdc and also on a Linux machine (1,8GHz PentiumM) using 
the original dmd compiler - but the results were the same