Search a file skiping whitespace

Sat Jul 16 14:46:56 PDT 2011

On 17.07.2011 1:28, Johann MacDonagh wrote:
> On 7/16/2011 5:07 PM, Dmitry Olshansky wrote:
>>
>> Let's drill down to the problem through this barrage of crap:
>>
>> the problem statement is
>>
>> Error: cannot implicitly convert expression (haystack) of type dchar[]
>> to string
>>
>> So (apparently) the problem is that after array(filter!(... you get
>> array of dchars (unicode codepoints)as a result of filtering string
>> (which is UTF-8 under the hood btw) while you are going to search an
>> UTF-8 string.
>> And UTF-8 string is (once again) is not random accessible in sense of
>> codepoints (it's needs an UTF decode though it's clearly not needed in
>> your case).
>> The simplest workaround I can think of is convert needle to dstring:
>> auto needle = boyerMooreFinder(to!dstring(args[1])); //found in std.conv
>>
>>
>
> Nope, that doesn't work.
>
> Here's what he should do:
>
> import std.algorithm;
> import std.array;
> import std.file;
> import std.stdio;
>
> void main(string[] args) {
>     foreach (string name; dirEntries(".", SpanMode.shallow)) {
>         if (name[$-3 .. $] == "txt") {
>             writeln(name);
>             string text = readText(name);
>             auto haystack = filter!("a >= '0' && a <='9'")(text);
>             auto result = find(haystack, args[1]);
>             writeln(result);
>         }
>     }
> }
>
> If you call array() on the filter you're already doing O(n) 
> operations, meaning you're not going to get any performance benefit by 
> being able to do Boyer-Moore.
>

Actually with brute force you do O(mn) comparisons in general while 
there are Boyer-Moor implementations that will do O(n). Now with an 
alphabet of 0..9 I'd say you won't get a lot of speed up with 
Boyer-Moore, in any case I'd suggest OP to at least test both. In this 
case  however my bet is on  bruteforce (given an extra allocation on 
each string with B-M).

> Correct me if I'm wrong, but I believe find() will do Boyer-Moore if 
> it knows its haystack is random access. This is the power of templated 
> programming. find! can generate an efficient find algorithm if it 
> knows the range is random access, or default to a less efficient one 
> if not.

No, the problem of BoyerMoor search is there is some preparatory 
overhead to construct lookup table, apparently that table is stored in 
that BoyerMoorFinder. Frankly, std.algorithm might add at least one 
another sting searching algorithm e.g.
http://monge.univ-mlv.fr/~lecroq/string/

-- 
Dmitry Olshansky