Find Semantically Correct Word Splits in UTF-8 Strings

Wed Oct 1 09:44:23 PDT 2014

On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
> I'm looking for a way to make my algorithm
>
>     S[] findWordSplit(S)(S word,
>                          HLang[] langs = [])
>     {
>         for (size_t i = 1; i + 1 < word.length; i++)
>         {
>             const first = word[0..i];
>             const second = word[i..$];
>             if (this.canMeanSomething(first, langs) &&
>                 this.canMeanSomething(second, langs))
>             {
>                 return [first,
>                         second];
>             }
>         }
>         return typeof(return).init;
>     }
>
> correctly work if S is a (UTF-8) string without first, in lazy 
> manner, encode word to a dstring.
>
> Currently this algorithm works as
>
> "carwash" => ["car", "wash"]
>
> and I would like it to work correctly and efficient in my 
> native language aswell as
>
> "biltvätt" => ["bil", "tvätt"]
>
> :)

Out of curiosity, why exactly isn't it working in your "native 
language"? If you avoid decoding in your "canMeanSomething", you 
should encounter no problems.