[GSOC idea] enhance regular expressions?

Sun Apr 10 13:21:53 PDT 2011

On 02.04.2011 14:58, Peter Chadwick wrote:
> Dmitry Olshansky<dmitry.olsh at gmail.com>  writes:
>
>> On 31.03.2011 22:16,petevik38 at yahoo.com.au  wrote:
>>> Dmitry Olshansky<dmitry.olsh at gmail.com>   writes:
>>>> --- somewhat informal draft ---
>>>>
>>> Hi Dmitry,
>>>
>>> It's good to know that there is interest in improving regular
>>> expressions in the D standard library. I've run into a number of
>>> problems using std.regex myself, so I'm keen so see either fixes for
>>> std.regex or an improved replacement.
>> Thanks for the interest.
>> Indeed there are problems, though the engine itself is quite capable
>> and optimized, I think it could be fixed.
> Great.
>
>>> I've been working on a regular expression library myself based around
>>> Russ Cox article athttp://swtch.com/~rsc/regexp/regexp2.html  and the
>>> std.regex range interface. It includes a backtracking and Thompson
>>> engine. If you are interested it is on github:
>>>
>>> https://github.com/PeteChadwick/pcregexeng
>>>
>>> The ddoc documentation is here:
>>> http://petechadwick.github.com/pcregexeng/
>>>
>> Wow, I'll definitely look into it.
>> Overall it seems solid even though incomplete. From glance it seems
>> that parsing a pattern is somewhat convulted, but I'll defer any
>> critics or decisions until I looked in closer detail.
> I'm open to any criticisms or suggestions you may have.

Ok, I think I got my criticisms sorted out.
Sorry for taking so much time, I've got really involved week, attending 
to a couple events and, in fact, helping organize one.

First things first - I like your structured way of expressing regex VM  
instructions by introducing distinct types for each, but let's examine 
what good does it get you in the end.
Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)&program[pos];
final switch( *instType )
  {
...
case REInst.IChar:
      InstIChar* inst = cast(InstIChar*)&program[pos]; //disadvantage

  writefln( "IChar %s", inst.c ); //benefit
               pos += InstIChar.sizeof;//not used //well to actually 
benefit of introduced types, should it be (*inst).sizeof ? ...

Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
     return 0;
state_pc += InstChar.sizeof;//not used

Which means that the whole typing isn't exploited much, but gets in your 
way far to often.
The idea to avoid all the mess with double casting is to introduce tag 
union:
struct IR{
     uint opcode;
     union{
         struct {
             dchar c;
         }
         struct {
             ubyte[8] bitmap;
         }
//etc..
     }
}
Then the double cast is removed:
auto inst = cast(IR*)&program[pos];
switch(inst.InstType){
...
case REInst.IChar:

  writefln( "IChar %s", inst.c ); pos += InstIChar.sizeof;

}
The only problem that remains is that we've lost the actual size of 
struct which varies based on an opcode value, and  constructors...
Solving both at once :
struct WrapedIR(REInst InstType,T)
{
     enum length = instLength(InstType);
     @property auto bytes(){// see later
         return (cast(byte*)data)[0..length];
     }
     T data;
    alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
     switch(type){
     case REInst.Char: case REInst.IChar:
             return sizeof(uint)+sizeof(dchar);
     ...
     }
}
and having helper function (which may use @property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }

will get us pretty much the same thing (or better) everywhere you create 
instructions:
auto charInst = makeInst!(REInst.Char)();

Which gets us to how you are creating instructions actually:
   static string MakeREInst( string instType, string instName )
     {
         string result =
             "byte["~instType~".sizeof] "~instName~"Buf;"~
             instType~"* "~instName~" = 
cast("~instType~"*)&"~instName~"Buf[0];"~
             "*"~instName~" = "~instType~"();";

         return result;
     }
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)&instNameBuf[0];
*instName = instType();

Even disregarding the suggested approach to IR representation, the 
better alternative I think would be the opposite - stack allocate 
struct, and treat it as array of bytes  when need be.
Isn't this one of prime cases for introducing types in the first place? 
Also, why mixins?
instType instName = instType(); //can't be any  worse then 
mixin(MakeREInst( "instType", "instName" );

Then for getting bytes of any struct use this one-liner (analogous to 
'bytes' of the above IR tag union):
ubyte[] bytesOf(T)(T* t){
     return (cast(byte*)t)[0..T.sizeof];
}
Essentially you need to treat instructions as bytes exactly once: when 
appending or inserting them into VM program.
program ~=  splitInstBuf  //append magically mixed-in buffer
     ...
becomes
program ~=  bytesOf(&splitInst)
     ...
Doesn't look half bad, and compiler wouldn't allow missing '&' if anything.
Look & feel clearly could be enhanced by changing pointer argument to 
ref argument in bytesOf, but I don't think that current DMD inliner 
works well with ref. (and we _do_ need inliner for such functions)

One more unrelated thing is the way you do the insertion of some IR 
instructions  (this one from  parseRegex, there are more):
program =
program[0..progConcatStart] // gets a slice of program, but since it's 
followed by other data it has capacity of 0
                ~ splitInstBuf //and so this always allocates new array!
                 ~ program[progConcatStart..$]; //and this might or 
might not allocate
So no matter how innocent  it looks, it's  1-2 allocation _every_ time.
To solve this troublesome consequence you need something like this (not 
actually tested ):
program.length += splitInstBuf.length;
memmove(
&program[progConcatStart+splitInstBuf.length],
&program[progConcatStart],
     program[0].sizeof*(program.length-progConcatStart)
);
program[progConcatStart..progConcatStart+splitBuf.length] = splitBuf[0..$];

...or simply cheat ;) and use std.array.insert:
insert(program,progConcatStart,splitBuf);

To further improve things also check on how to exploit Appender in 
Phobos, it should be faster for continuous appending.

-- 
Dmitry Olshansky