std.regexp.split very slow - a bug? - more absurdities

Wed Feb 7 04:51:30 PST 2007

std.regex.split very slow -> bug?

> hi again,
> two days ago i had posted that regular expressions
> were running very slow - here comes a more detailed
> description of that problem.
> 
> It seems that the std.regexp.split function has a 
> problem (or i am a moron and use it in a wrong way,
> but, as i mentioned before i am a biologist and not
> a professional programmer and i'd be happy about 
> help if case 2 is true).
> 
> The thing i want to do is read in a DNA sequence file
> that also contains information about the genes found
> in the raw DNA sequence. The file is in a commonly
> used format called GenBank and the different data segments
> are seperated by keywords which should make it easy 
> to use std.regexp.split to dissect it. The following code 
> is just an example that tries to split a GenBank file 
> at the "ORIGIN" keyword. The file has a size of 323 KB
> and if you want to reproduce my "experiment" you 
> can obtain it here:
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634
> 
> 
> // regex.d
> import std.stdio;
> import std.regexp;
> import std.file;
> 
> void main()
> {
> 	char [] gb_data;
> 	char [][] segments;
> 	gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
> 	
> 	segments = split(gb_data, "ORIGIN", "");
> 	
> 	writefln("seq segments: ", segments.length);
> 	
> }
> 
> The following happens when i run it:
> marc at marclinux:~/Desktop> time ./regex
> seq segments: 197549
> 
> real    12m52.420s
> user    12m48.132s
> sys     0m2.812s
> 
> The execution takes ALMOST THIRTEEN MINUTES!! which
> made me fall from my chair. After having climbed back
> on it i tried the same in perl:
> 
> #regex.pl
> #!/usr/bin/perl -w
> 
> use strict;
> 
> my $gb_data = "";
> my @segments;
> 
> open FILE, "/home/marc/Desktop/tobacco.gb";
> while (<FILE>)
> {
> 	$gb_data .= $_;
> }
> close FILE;
> 
> @segments = split /ORIGIN/, $gb_data;
> 
> print "seq segment: ".length($segments[1])."\n"
> 
> output:
> marc at marclinux:~/Desktop> time ./regex.pl
> seq segment: 197549
> 
> real    0m0.034s
> user    0m0.024s
> sys     0m0.012s
> 
> ....well it took 34ms to do the same thing. I could
> not believe it and rewrote it using std.regexp.search 
> instead of std.regexp.split:
> 
> //regex2.d
> import std.stdio;
> import std.regexp;
> import std.file;
> 
> void main()
> {
> 	char [] gb_data;
> 	
> 	gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
> 	
> 	auto m = search(gb_data, "ORIGIN", "");
> 	
> 	writefln("seq segment: ", m.post.length);
> 	
> }
> 
> output:
> marc at marclinux:~/Desktop> time ./regex2
> seq segment: 197549
> 
> real    0m0.025s
> user    0m0.024s
> sys     0m0.000s
> 
> AHA. So D is faster than Perl - it took 25ms, but the
> split function is obviously *not suitable* for splitting
> a long text at a simple, single word (actually this does not
> even make use of complicated regular expression snytax).
> 
> Becoming curious i rewrote the thing again, now using
> std.string.find:
> 
> //find.d
> import std.stdio;
> import std.string;
> import std.file;
> 
> void main()
> {
> 	char [] gb_data, seq_segment, pattern;
> 	long pos;
> 	
> 	pattern = "ORIGIN";
> 	
> 	gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
> 	
> 	pos = find(gb_data, pattern);
> 	seq_segment = gb_data[(pos+pattern.length)..gb_data.length];
> 	
> 	writefln("seq segment: ",seq_segment.length);
> 	//writefln("SEQ segment", m.post);
> 	
> }
> 
> output:
> marc at marclinux:~/Desktop> time ./find
> seq segment: 197549
> 
> real    0m0.005s
> user    0m0.000s
> sys     0m0.004s
> marc at marclinux:~/Desktop> 
> 
> whoa! Now it only takes 5ms. So my problem seems to be
> solved - i will use either the search or the find variant.
> 
> Interestingly, when splitting the same text at newlines
> the execution just takes about 13ms. I have no idea why
> the split function behaves so differently and this is also
> my question for the experts.
> 
> cheers
> ml

The same funny thing happens when using
std.regexp.sub. In the following line 
i want to remove all non-DNA characters 
from the read in sequence segment using sub:

stripped_sequence = sub(seq_segment, "[0-9\n\t/ ]", "", "g");

output:
time ./bio_test
real    0m17.154s
user    0m16.737s
sys     0m0.032s

Again this expression takes unexpectedly
long to execute: about 17s on my PentiumM 1,8GHz.

When i reformulate the task avoiding regular
expressions:

char[] clean_seq = "";
	foreach (char N; stripped_sequence)
	{
		if ((N == '0') || (N == '1') || (N == '2') || (N == '3') ||
		(N == '4') || (N == '5') || (N == '6') || (N == '7') ||
		(N == '8') || (N == '9') || (N == ' ') || (N == '\n') ||
		(N == '\t') || (N == '/')) continue;
		clean_seq ~= N;
	}

it looks (and is) very ugly but it runs 
faster:

output:
time ./bio_test
real    0m0.413s
user    0m0.040s
sys     0m0.004s

Note that the actual computation time is only
about 40ms (the real time is longer because
the sequence and other info is printed to STDOUT).

Again my question what's wrong here? I used the
regexp.sub exactly the way that it's used in
public example code snippets. Have other people
also had these problems or am i the first to
use the regular expressions of D on longer 
text strings? (although i wouldn't think that
323KB of text are really long). Any help and
or suggestions|comments would be extremely welcome!