prefix match of a regex and optimized dirEntries for regex search
Dmitry Olshansky via Digitalmars-d
digitalmars-d at puremagic.com
Wed Jun 18 12:37:11 PDT 2014
18-Jun-2014 21:38, Timothee Cour via Digitalmars-d пишет:
> I made a simple modification to std.regex to allow an option to prefix
> match a regex.
> Formally, if L(R) is the language recognized by a regex R, the language
> recognized by prefix matching of R is:
>
> L(p(R)) = prefix(L(R)) = {u : uv in L(R) for some v}
> Trying to come up (by hand or algorithmically) with a regex R' such that
> L(R') L(p(R)) is awkward and inefficient, eg:
>
So essentially it's just a set of all partial matches?
> R='hello';
> R'=`|h|he|hell|hello` = `(h(e(l(l(o)?)?)?)?)?`;
>
> However thinking in terms of state machine this is much easier and
> efficient.
>
> It looks like this:
> assert("hel".match(`hello\d+`.regex("p")); //p for prefix match
>
Well, since "p" would apply to operation rather then regex I think
adding a partialMatch function would be simpler.
Plus it may need to return extended information like where in the
pattern it stopped, so the return type/API may need to be different from
normal match.
> If there's interest in adding this I can prepare a pull request and we
> can discuss more.
>
Sure partial matching wold be nice to have and any help in this
direction is warmly welcome.
Would be cool if you could take a look at
https://github.com/D-Programming-Language/phobos/pull/2216
It's nothing special except splitting up std.regex into a package but
it's a stepping stone of going forward with it.
> Example use case:
> I wrote a function to search a file given a regex, and it is optimized
> to prune directories early on if they fail to prefix match the regex, eg:
I'm not sure I follow - a normal regex match would do essentially the
same by first checking the prefix and failing on wrong ones. Where is
optimization?
What could be interesting is to limit matching to not go beyond say 80
code units and report if it's (partial) match of given pattern.
>
> dirEntriesOptimized(`abc/folder_\d+/\w+\.cpp`)
> when encountering `abc/bad_subfolder/` it will not recurse on this as it
> fails the prefix regex match.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list