std.regexp vs std.regex [Re: RegExp.find() now crippled]

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Mon Nov 15 12:53:51 PST 2010


On 11/15/10 7:55 AM, Steve Teale wrote:
> KennyTM~ Wrote:
>
>> On Nov 15, 10 14:58, Steve Teale wrote:
>>> Some time ago in phobos2, the following:
>>>
>>>      RegExp wsr = RegExp("(\\s+)");
>>>      int p = wsr.find("<thingie att1=\"whatever\">");
>>>      writefln("%s|%s|%s %d",wsr.pre(),  wsr.match(1), wsr.post(), p);
>>>
>>> would print:
>>>
>>> <thingie| |att1="whatever">   7
>>>
>>> Now it prints
>>>
>>> <thingie| |att1="whatever">   1
>>>
>>> The new return value is pretty useless, equivalent to returning a bool. It seems to me that the 'find' verb's subject should be the string, not the RegExp object.
>>>
>>> This looks like a case of the implementation being changed to match the documentation, when in fact it would have been better to change the documentation to match the implementation.
>>>
>>> Either that, or RegExp should have an indexOf method that behaves like string.indexOf.
>>>
>>> Steve
>>>
>>
>> Isn't std.regexp replaced by std.regex? Why are both of them still in
>> Phobos 2?
>>
>> (oh, and std.regex is missing a documented .index (= .src_start) property.)
>
> I guess std.regexp is still there because not all of us necessarily
> want to iterate a range to simply find out the position of the first
> whitespace in a string. Part of the expressiveness of languages is
> that one should be free to use the style that suits, and not have to
> read the documentation every time one uses it. Give me options in
> Phobos by all means.
>
> D2 is not going to succeed by forcing its users to use unfamiliar,
> and maybe not yet very fashionable constructions.
>
> I'm pissed off because this change broke a lot of my code, which I
> had not used for some time, but now have a paying customer for. The
> code did not break because of D language evolution. It broke because
> somebody decided they did not like the style of std.regexp.  All I
> wanted was plain old regular expressions, similar to JavaScript, or
> PHP, or other popular languages, and std.regexp did that pretty well
> at one time.
>
> Steve

I am sorry for the inadvertent change, it wasn't meant to change 
semantics of existing code. I'm not sure whether one of my unrelated 
64-bit changes messed things up. You may want to file a bug report.

There are a number of good reasons for which I was compelled to split 
std.regex from std.regexp. I'm sure you or others would have found them 
just as compelling if you saw things the same way.

Phobos 1 has experimented in std.string and std.regexp with juxtaposing 
APIs of various languages (PHP, Ruby, Python). The reasoning was that 
people familiar with either of those languages could feel right at home 
by using APIs with similar nomenclatures and semantics. The result was 
some strange bedfellows in std.string such as "column" or "capwords" and 
an outright mess in std.regexp. The interface of std.regexp is without a 
doubt the worst I've ever seen, by a long shot. I have never been able 
to use it without poring through the documentation _several times_ and 
without confirming to myself via a small test case that I'm doing the 
right thing.

The simplest problem is this: std.regexp uses the words "exec", "find", 
"match", "search", and "test" - all to mean regular expression matching. 
There is absolutely no logic to how meanings are ascribed to words, and 
there is absolutely no recourse than rote memorization of various 
arbitrary decisions.

The resulting FrankenAPI is likely familiar to anyone except those 
who've actually spent time learning it, in spite of it trying to be 
familiar to anyone.

So I spawned std.regex in an attempt to sanitize the API (I made minor, 
if any, changes to the engine; I am in fact having significant trouble 
maintaining it). The advantages of std.regex are:

* No more class definition. Nobody is supposed to inherit RegExp anyway 
so it's useless to brand the object as a class.

* Engine is separated from matches, which means that engines can be 
memoized for efficiency. Currently regex() only memoizes the last engine.

* The new engine works with any character size.

* Simpler API: create a regex, call match() against that regex and a 
string, look at the resulting RegexMatch object.

If this all annoys you more than the old API, I will need to disagree. 
If you have suggestions on how std.regex can be improved, I'm all ears.


Andrei


More information about the Digitalmars-d mailing list