[phobos] UTF-8 string slicing
benjamin at precisionsoftware.us
Sun Aug 21 21:00:26 PDT 2011
On 08/19/2011 08:40 PM, Jonathan M Davis wrote:
> On Friday, August 19, 2011 19:58:34 Benjamin Shropshire wrote:
>> On 08/18/2011 02:21 AM, unDEFER wrote:
>>> D language specification says that it supports UTF-8 strings, but I
>>> find how to slice UTF-8 string by character index, not by bytes numbers.
>>> Why there is no simple slice function in std.utf like attached code?
>> BTW: your code is flawed. Feed it some of the stuff near the end of this
>> post and it will fail:
>> tl;dr; your code doesn't slice on characters but something called (IIRC)
>> code points. If you start worrying about diacritic (and many end user
>> will want you to)
>> you need to do a bunch more processing.
> His code works as well as slicing a dstring does - save for the efficiency
> issues. There is no way in Phobos at present to deal with graphemes. All of
> the string processing in Phobos deals with code points. For the most part,
> this works great, but it is true that it isn't complete. I expect that we'll
> get grapheme support eventually (Ibelieve that Dmitry has done some work on a
> grapheme range for the updates that he's been doing to std.regex for GSoC, so
> we may get it from there). But for now, none of the string processing in D
> worries about graphemes - just code points.
My thought on that subject is: I can see good reason to index on proper
characters (get the 4th char in the word), good reason to index to a
character (or sometimes a code point) near some byte position and there
are clearly good reason to iterate thought code points, but I don't see
much value to be had from asking for a random Nth code point that can't
be had via something that has fewer problem and/or is cheaper.
> - Jonathan M Davis
> phobos mailing list
> phobos at puremagic.com
More information about the phobos