std.regexp.split very slow - a bug?
Marc Lohse
lohse at mpimp-golm.mpg.de
Wed Feb 7 03:08:05 PST 2007
hi again,
two days ago i had posted that regular expressions
were running very slow - here comes a more detailed
description of that problem.
It seems that the std.regexp.split function has a
problem (or i am a moron and use it in a wrong way,
but, as i mentioned before i am a biologist and not
a professional programmer and i'd be happy about
help if case 2 is true).
The thing i want to do is read in a DNA sequence file
that also contains information about the genes found
in the raw DNA sequence. The file is in a commonly
used format called GenBank and the different data segments
are seperated by keywords which should make it easy
to use std.regexp.split to dissect it. The following code
is just an example that tries to split a GenBank file
at the "ORIGIN" keyword. The file has a size of 323 KB
and if you want to reproduce my "experiment" you
can obtain it here:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634
// regex.d
import std.stdio;
import std.regexp;
import std.file;
void main()
{
char [] gb_data;
char [][] segments;
gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
segments = split(gb_data, "ORIGIN", "");
writefln("seq segments: ", segments.length);
}
The following happens when i run it:
marc at marclinux:~/Desktop> time ./regex
seq segments: 197549
real 12m52.420s
user 12m48.132s
sys 0m2.812s
The execution takes ALMOST THIRTEEN MINUTES!! which
made me fall from my chair. After having climbed back
on it i tried the same in perl:
#regex.pl
#!/usr/bin/perl -w
use strict;
my $gb_data = "";
my @segments;
open FILE, "/home/marc/Desktop/tobacco.gb";
while (<FILE>)
{
$gb_data .= $_;
}
close FILE;
@segments = split /ORIGIN/, $gb_data;
print "seq segment: ".length($segments[1])."\n"
output:
marc at marclinux:~/Desktop> time ./regex.pl
seq segment: 197549
real 0m0.034s
user 0m0.024s
sys 0m0.012s
....well it took 34ms to do the same thing. I could
not believe it and rewrote it using std.regexp.search
instead of std.regexp.split:
//regex2.d
import std.stdio;
import std.regexp;
import std.file;
void main()
{
char [] gb_data;
gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
auto m = search(gb_data, "ORIGIN", "");
writefln("seq segment: ", m.post.length);
}
output:
marc at marclinux:~/Desktop> time ./regex2
seq segment: 197549
real 0m0.025s
user 0m0.024s
sys 0m0.000s
AHA. So D is faster than Perl - it took 25ms, but the
split function is obviously *not suitable* for splitting
a long text at a simple, single word (actually this does not
even make use of complicated regular expression snytax).
Becoming curious i rewrote the thing again, now using
std.string.find:
//find.d
import std.stdio;
import std.string;
import std.file;
void main()
{
char [] gb_data, seq_segment, pattern;
long pos;
pattern = "ORIGIN";
gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb");
pos = find(gb_data, pattern);
seq_segment = gb_data[(pos+pattern.length)..gb_data.length];
writefln("seq segment: ",seq_segment.length);
//writefln("SEQ segment", m.post);
}
output:
marc at marclinux:~/Desktop> time ./find
seq segment: 197549
real 0m0.005s
user 0m0.000s
sys 0m0.004s
marc at marclinux:~/Desktop>
whoa! Now it only takes 5ms. So my problem seems to be
solved - i will use either the search or the find variant.
Interestingly, when splitting the same text at newlines
the execution just takes about 13ms. I have no idea why
the split function behaves so differently and this is also
my question for the experts.
cheers
ml
More information about the Digitalmars-d
mailing list