Find Semantically Correct Word Splits in UTF-8 Strings

monarch_dodra via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Oct 1 10:09:56 PDT 2014


On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
> On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
>> I'm looking for a way to make my algorithm
>>
>
> Update:
>
>     S[] findMeaningfulWordSplit(S)(S word,
>                                    HLang[] langs = []) if 
> (isSomeString!S)
>     {
>         for (size_t i = 1; i + 1 < word.length; i++)
>         {
>             const first = word.takeExactly(i).to!string;

Does that even work? takeExactly would pop up to N *codepoints*, 
whereas your string only has N *codeunits*.

Something like:

for (auto second = str ; !second.empty ; second.popFront() )
{
     auto first = str[0 .. $ - second.length];
     ...
}
//special case str + str[$ .. $] here. (or adapt your loop)

Would also be unicode correct, without increasing the original 
complexity.


More information about the Digitalmars-d-learn mailing list