Find Semantically Correct Word Splits in UTF-8 Strings

monarch_dodra via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Oct 1 09:42:46 PDT 2014


On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
> On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
>> I'm looking for a way to make my algorithm
>>
>
> Update:
>
>     S[] findMeaningfulWordSplit(S)(S word,
>                                    HLang[] langs = []) if 
> (isSomeString!S)
>     {
>         for (size_t i = 1; i + 1 < word.length; i++)
>         {
>             const first = word.takeExactly(i).to!string;
>             const second = word.dropExactly(i).to!string;
>             if (this.canMeanSomething(first, langs) &&
>                 this.canMeanSomething(second, langs))
>             {
>                 return [first,
>                         second];
>             }
>         }
>         return typeof(return).init;
>     }
>
> works but has unfortunately O(word.length^^2) time-complexity.
>
> Do we need a new InputRange algorithm for this?

This seems like a text-book case for a trie algorithm:
http://en.wikipedia.org/wiki/Trie

Once your "dictionary" is built, it can find *all* the splits in 
O(word.length).

...but I don't know how well it adapts to the range interface. 
Should still work though.

Also, Aho–Corasick might be relevant depending on what you want 
to do, for example, find all substrings that happen to be a word, 
regardless of what comes before or after:
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm


More information about the Digitalmars-d-learn mailing list