prefix match of a regex and optimized dirEntries for regex search

Thu Jun 19 04:34:00 PDT 2014

19-Jun-2014 11:36, Timothee Cour via Digitalmars-d пишет:
> On Wed, Jun 18, 2014 at 12:37 PM, Dmitry Olshansky via Digitalmars-d
> <digitalmars-d at puremagic.com <mailto:digitalmars-d at puremagic.com>> wrote:
>
>     18-Jun-2014 21:38, Timothee Cour via Digitalmars-d пишет:
>
>         I made a simple modification to std.regex to allow an option to
>         prefix
>         match a regex.
>         Formally, if L(R) is the language recognized by a regex R, the
>         language
>         recognized by prefix matching of R  is:
>
>         L(p(R)) = prefix(L(R)) = {u : uv in L(R) for some v}
>         Trying to come up (by hand or algorithmically) with a regex R'
>         such that
>         L(R') L(p(R)) is awkward and inefficient, eg:
>
>     So essentially it's just a set of all partial matches?
>
>
> Not sure, that depends on your definition of partial matches; how do you
> formally define partial matches?
>
> There's a subtlety, let's call my proposal prefixMatch:
> "hel".prefixMatch("^hello") //true
> "".prefixMatch("^hello") //true
> "mel".prefixMatch("^hello") //false
>
> But if the regex doesn't start with `^` then it is not useful:
>
> "whatever".prefixMatch("hello") //true
> because the scanner will try each position in the input string and fail,
> except for the last one where it'll succeed trivially ("" is a prefix of
> "hello").

Well, I think partialMatch is only defined in the current position. 
Unlike normal match we can't provide default meaningful action e.g. if a 
trivial "" result means search next or just report trivial partial match.
>
> In other words, this is more useful for one shot matching.

Yes, it makes that much more sense for one shot matching IMO.

> But we can also consider partial partial matches if we provide
> additional constraints given via a lambda:
>
> "blabla hel blabla".partialMatch!(a=>a.hit.length>=3) ("hello")
> => finds a match ("hel") because the lambda succeeds on this match.
>
> partialMatch could also be used to find the longest partial match
> (either in terms of length of matching string or length of the regex):
>
> auto longestPartialMatch(R)(string input, R regex){
> string longest;
> while(true){
> auto m=partialMatch!(a=>a.hit.length>longest.length) (input,regex);
> if(m) longest=a.hit; else return longest;
> }

Seems doable as helpers on top, but this kind of thing could be made 
faster if built-in in std.regex like you suggest. I also can't see 
anything better then lambda at the moment.

>     Would be cool if you could take a look at
>     https://github.com/D-__Programming-Language/phobos/__pull/2216
>     <https://github.com/D-Programming-Language/phobos/pull/2216>
>
>     It's nothing special except splitting up std.regex into a package
>     but it's a stepping stone of going forward with it.
>
>
> That's a very welcome change, this module is too big currently. That
> seems like it would be the 1st phobos module to take advantage of
> package feature?
>

There is competition of std.container!
https://github.com/D-Programming-Language/phobos/pull/2174
Truth be told it's the most obvious candidate, as containers have 
nothing to do with each other.

>
>
>         Example use case:
>         I wrote a function to search a file given a regex, and it is
>         optimized
>         to prune directories early on if they fail to prefix match the
>         regex, eg:
>
>
>     I'm not sure I follow - a normal regex match would do essentially
>     the same by first checking the prefix and failing on wrong ones.
>     Where is optimization?
>
>
> The optimization is avoiding unnecessary file system access by pruning
> early on directories that we can prove will not contain any file which
> will match the regex. Without optimization, you have to consider (and
> reject) `abc/bad_subfolder/file_1`. With optimization, you never have to
> consider that file (nor ask the file system to list it), as
> `abc/bad_subfolder/` fails the prefix regex test (see my example). A
> normal regex needs to be applied to the whole path, and you need prefix
> matching to reject partial paths (ie, directories).

Now I see - you don't have the full input while traversing directories, 
cool idea.

>
>     What could be interesting is to limit matching to not go beyond say
>     80 code units and report if it's (partial) match of given pattern.
>
>
> That'd be nice; it needs to be as generic as possible; I propose using a
> user defined lambda (see above).

Well, why not but I'd suggest to focus on getting the partialMatch at 
the current position as a good generic building block to extend from.

-- 
Dmitry Olshansky