Latest string_token Code

Ben Hanson Ben.Hanson at tfbplc.co.uk
Tue Jun 22 08:59:13 PDT 2010


== Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
> On 06/22/2010 08:13 AM, Ben Hanson wrote:
> > Here's the latest with naming convention (hopefully) followed. I've implemented my
> > own squeeze() function and used sizeof in the memmove calls.
> I suggest you to look into using the range primitives (empty, front,
> back, popFront, and popBack) with strings of any width. Your code
> assumes that all characters have the same width and therefore will
> behave erratically on UTF-8 and UTF-16 encodings.
> In the particular case of squeeze(), you may want to use uniq instead,
> which works on any forward range and will therefore decode characters
> properly:
> http://www.digitalmars.com/d/2.0/phobos/std_algorithm.html#uniq
> Andrei

OK, thanks.

Don't forget these are regular expressions though. I was wondering whether people
really want to pass regular expressions UTF encoded, but I suppose it could
happen. It's certainly a good idea to get used to using UTF compatible functions
anyway.

Is there is any support for Unicode continuation characters yet? Do you agree that
(ideally) Unicode text should be normalised before searching?

Regards,

Ben


More information about the Digitalmars-d mailing list