Find Semantically Correct Word Splits in UTF-8 Strings

Wed Oct 1 04:06:23 PDT 2014

I'm looking for a way to make my algorithm

     S[] findWordSplit(S)(S word,
                          HLang[] langs = [])
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word[0..i];
             const second = word[i..$];
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

correctly work if S is a (UTF-8) string without first, in lazy 
manner, encode word to a dstring.

Currently this algorithm works as

"carwash" => ["car", "wash"]

and I would like it to work correctly and efficient in my native 
language aswell as

"biltvätt" => ["bil", "tvätt"]

:)