Questions about builtin RegExp

Mon Feb 20 16:10:40 PST 2006

Regan Heath wrote:
> Walter Bright <newshound at digitalmars.com> wrote:
>> "Regan Heath" <regan at netwin.co.nz> wrote 
>>
>>> Here's how I'd do it:
>>
>> Your's is a lot of code to do what a regex does.
> 
> This is true, though my code is likely faster.
> 
>> Now recognize a url <g>.
> 
> Nah. You've made your point.. in fact I was secretly trying to help. <g>

DISCLAIMER INSERTED WHEN PROOFREADING:

I'm not attacking you, or anybody's opinion here, I'm just thinking 
aloud -- mostly to sort out my own opinion on this issue!  :-)

> Regex is a good general purpose string parsing facility. I personally 
> find  composing a regex can be complicated, likely it's easier with 
> practice. A  custom piece of code is probably faster and I find it 
> easier to tweak. In  the end, unless it was performance critical or has 
> resisted my initial  efforts at composing a regex, I'd probably use a 
> regex.

Heh, interestingly, I have the same feeling about all three!! (I.e. 
composing nontrivial regexes is hard, custom code is faster and easier 
to tweak.)

But I can't but wonder whether I'm wrong on all three!

In other words, writing custom code to do the same as a nontrivial 
regexp might feel the easier choice at the outset, but the sheer number 
of lines required (for example for the url recognition task) makes the 
code error prone and unobvious.

And I too _feel_ that the custom code would be faster, but, on second 
thought, I'd probably have to do some intensive optimizing cycles if I 
were against an average regexp implementation. ;-( This regexp stuff is 
"well understood" and polished during decades, after all.

As to "easier to tweak", suppose that Boss comes to you 2 months later 
and wants this Url Recognizer (which you had to write in a hurry to 
compete with the regexp guy in the next cubicle) to only accept 
top-level domains in country specific urls, you'd be hard put to know 
where to start tweaking, while the other guy gets it right in 30 seconds 
flat tweaking his regexp code.

(The boss' tweak accepts foo.fi but not foo.bar.fi nor foo.com)

> Here's how I'd do it:
> 
> import std.stdio;
> import std.string;
> 
> char[] some_text = "The email address Walter is posting from is  newshound at digitalmars.com.  The headers for your message have  <news at terrainformatica.com>, so I would assume that is your address.  My  address can be found in this HTML: <a  href=\"mailto:unknown at simplemachines.org\">my email</a>";
> 
> void main()
> {
>     char[][] res;   
>     res = parse_string(some_text);
>     foreach(int i, char[] r; res)
>         writefln("%d. %s",i+1,r);
> }
> 
> bool valid_email_char(char c)
> {
>     char* special = "<>()[]\\.,;:@\"";
>     if (c == '.') return true;
>     if (c <= 0x1F) return false;
>     if (c == 0x7F) return false;
>     if (c == ' ') return false;
>     if (strchr(special,c)) return false;
>     return true;
> }
> 
> char[][] parse_string(char[] text)
> {
>     char[][] res;
>     char* raw = toStringz(text);
>     char* p;
>     char* e;
>     
>     for(p = strchr(raw,'@'); p; p = strchr(e,'@')) {
>         for(e = p+1; valid_email_char(*e); e++) {}
>         if (e > raw && *(e-1) == '.') e--;
>         for(; p > raw && valid_email_char(*(p-1)); p--) {}
>         res ~= p[0..(e-p)]; //add .dup if required
>     }
>     
>     return res;
> }