code review: splitIds from DConf '22 day 3: saving a sort and "getting performance"
kdevel
kdevel at vogtner.de
Thu Aug 4 13:18:40 UTC 2022
At DConf '22 day 3 Robert Schadek presented at around 07:22:00 in
the YT video the function `splitIds`. Given an HTML page from
bugzilla containing a list of issues `splitIds` aims at
extracting all bug-ids referenced within a specific url context:
```
long [] splitIds (string page)
{
enum re = ctRegex!(`"show_bug.cgi\?id=[0-9]+"`);
auto m = page.matchAll (re);
return m
.filter!(it => it.length > 0) // what is
this?
.map!(it => it.front) // whole
match, it[0]
.map!(it => it.find!(isNumber)) // searches
fist number
.map!(it => it.until!(it => !it.isNumber ())) // last number
.filter!(it => !it.empty) // again an
empty check??? why?
.map!(it => it.to!long ())
.uniq // .sort is
missing. IMHO saving at the wrong things?
.array;
}
```
`m` contains all matches. It is a "list of lists" as one would
say in Perl. The "inner lists" contains as first element
("`front`") the string which matches the whole pattern. So my
first question is:
What is the purpose of the first filter call? Since the element
of `m` is a match it cannot have a length of 0.
Then the code – and this took me some time to grasp – zooms into
the match and tries to match the subexpression which starts with
a number and ends with a number. But why does the author not use
regular expression subexpression matching [1]? Is that really
such an arcane discipline?
Next comes a filter call that seems to defend against the empty
string. But there will be no empty string because `[0-9]+` only
matches if there is at least one number.
Then there is something missing: Before the `uniq` call there is
the assumption that bugzilla ordered the bugs already. What
puzzles me is that on the one hand the author puts two find calls
into the chain to guard against empty matches which cannot (?)
happen but on the other hand ignores that the data may not be
sorted. I mean the sortedness is nothing that the author of the
function could control.
This is what my version of `splitIds` looks like:
```
long [] my_splitIds (string page)
{
enum re = ctRegex!(`"show_bug.cgi\?id=([0-9]+)"`);
auto m = page.matchAll (re);
return m
.map!(it => it[1]) // take the subexpression
.map!(it => it.to!long ())
.array
.sort
.uniq
.array
;
}
```
For those who ask how one would write such a function in Perl:
```
sub splitIds ($)
{
my $page = shift;
my @ids = $page =~ m/"show_bug.cgi\?id=([0-9]+)"/g;
my %hash = map { $_ => undef } @ids; # perl has no builtin uniq
sort { $a <=> $b } keys %hash; # force numeric sorting
with <=>
}
```
[1] https://dlang.org/phobos/std_regex.html
More information about the Digitalmars-d-learn
mailing list