[GSOC idea] enhance regular expressions?
Dmitry Olshansky
dmitry.olsh at gmail.com
Sun Apr 10 13:21:53 PDT 2011
On 02.04.2011 14:58, Peter Chadwick wrote:
> Dmitry Olshansky<dmitry.olsh at gmail.com> writes:
>
>> On 31.03.2011 22:16,petevik38 at yahoo.com.au wrote:
>>> Dmitry Olshansky<dmitry.olsh at gmail.com> writes:
>>>> --- somewhat informal draft ---
>>>>
>>> Hi Dmitry,
>>>
>>> It's good to know that there is interest in improving regular
>>> expressions in the D standard library. I've run into a number of
>>> problems using std.regex myself, so I'm keen so see either fixes for
>>> std.regex or an improved replacement.
>> Thanks for the interest.
>> Indeed there are problems, though the engine itself is quite capable
>> and optimized, I think it could be fixed.
> Great.
>
>>> I've been working on a regular expression library myself based around
>>> Russ Cox article athttp://swtch.com/~rsc/regexp/regexp2.html and the
>>> std.regex range interface. It includes a backtracking and Thompson
>>> engine. If you are interested it is on github:
>>>
>>> https://github.com/PeteChadwick/pcregexeng
>>>
>>> The ddoc documentation is here:
>>> http://petechadwick.github.com/pcregexeng/
>>>
>> Wow, I'll definitely look into it.
>> Overall it seems solid even though incomplete. From glance it seems
>> that parsing a pattern is somewhat convulted, but I'll defer any
>> critics or decisions until I looked in closer detail.
> I'm open to any criticisms or suggestions you may have.
Ok, I think I got my criticisms sorted out.
Sorry for taking so much time, I've got really involved week, attending
to a couple events and, in fact, helping organize one.
First things first - I like your structured way of expressing regex VM
instructions by introducing distinct types for each, but let's examine
what good does it get you in the end.
Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)&program[pos];
final switch( *instType )
{
...
case REInst.IChar:
InstIChar* inst = cast(InstIChar*)&program[pos]; //disadvantage
writefln( "IChar %s", inst.c ); //benefit
pos += InstIChar.sizeof;//not used //well to actually
benefit of introduced types, should it be (*inst).sizeof ? ...
Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
return 0;
state_pc += InstChar.sizeof;//not used
Which means that the whole typing isn't exploited much, but gets in your
way far to often.
The idea to avoid all the mess with double casting is to introduce tag
union:
struct IR{
uint opcode;
union{
struct {
dchar c;
}
struct {
ubyte[8] bitmap;
}
//etc..
}
}
Then the double cast is removed:
auto inst = cast(IR*)&program[pos];
switch(inst.InstType){
...
case REInst.IChar:
writefln( "IChar %s", inst.c ); pos += InstIChar.sizeof;
}
The only problem that remains is that we've lost the actual size of
struct which varies based on an opcode value, and constructors...
Solving both at once :
struct WrapedIR(REInst InstType,T)
{
enum length = instLength(InstType);
@property auto bytes(){// see later
return (cast(byte*)data)[0..length];
}
T data;
alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
switch(type){
case REInst.Char: case REInst.IChar:
return sizeof(uint)+sizeof(dchar);
...
}
}
and having helper function (which may use @property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }
will get us pretty much the same thing (or better) everywhere you create
instructions:
auto charInst = makeInst!(REInst.Char)();
Which gets us to how you are creating instructions actually:
static string MakeREInst( string instType, string instName )
{
string result =
"byte["~instType~".sizeof] "~instName~"Buf;"~
instType~"* "~instName~" =
cast("~instType~"*)&"~instName~"Buf[0];"~
"*"~instName~" = "~instType~"();";
return result;
}
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)&instNameBuf[0];
*instName = instType();
Even disregarding the suggested approach to IR representation, the
better alternative I think would be the opposite - stack allocate
struct, and treat it as array of bytes when need be.
Isn't this one of prime cases for introducing types in the first place?
Also, why mixins?
instType instName = instType(); //can't be any worse then
mixin(MakeREInst( "instType", "instName" );
Then for getting bytes of any struct use this one-liner (analogous to
'bytes' of the above IR tag union):
ubyte[] bytesOf(T)(T* t){
return (cast(byte*)t)[0..T.sizeof];
}
Essentially you need to treat instructions as bytes exactly once: when
appending or inserting them into VM program.
program ~= splitInstBuf //append magically mixed-in buffer
...
becomes
program ~= bytesOf(&splitInst)
...
Doesn't look half bad, and compiler wouldn't allow missing '&' if anything.
Look & feel clearly could be enhanced by changing pointer argument to
ref argument in bytesOf, but I don't think that current DMD inliner
works well with ref. (and we _do_ need inliner for such functions)
One more unrelated thing is the way you do the insertion of some IR
instructions (this one from parseRegex, there are more):
program =
program[0..progConcatStart] // gets a slice of program, but since it's
followed by other data it has capacity of 0
~ splitInstBuf //and so this always allocates new array!
~ program[progConcatStart..$]; //and this might or
might not allocate
So no matter how innocent it looks, it's 1-2 allocation _every_ time.
To solve this troublesome consequence you need something like this (not
actually tested ):
program.length += splitInstBuf.length;
memmove(
&program[progConcatStart+splitInstBuf.length],
&program[progConcatStart],
program[0].sizeof*(program.length-progConcatStart)
);
program[progConcatStart..progConcatStart+splitBuf.length] = splitBuf[0..$];
...or simply cheat ;) and use std.array.insert:
insert(program,progConcatStart,splitBuf);
To further improve things also check on how to exploit Appender in
Phobos, it should be faster for continuous appending.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list