[phobos] UTF-8 string slicing

Jonathan M Davis jmdavisProg at gmx.com
Fri Aug 19 20:40:06 PDT 2011


On Friday, August 19, 2011 19:58:34 Benjamin Shropshire wrote:
> On 08/18/2011 02:21 AM, unDEFER wrote:
> > Hello!
> > 
> > D language specification says that it supports UTF-8 strings, but I
> > can't
> > find how to slice UTF-8 string by character index, not by bytes numbers.
> > Why there is no simple slice function in std.utf like attached code?
> 
> BTW: your code is flawed. Feed it some of the stuff near the end of this
> post and it will fail:
> 
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm
> l-self-contained-tags/1732454#1732454
> 
> tl;dr; your code doesn't slice on characters but something called (IIRC)
> code points. If you start worrying about diacritic (and many end user
> will want you to)
> you need to do a bunch more processing.
> 
> http://en.wikipedia.org/wiki/Diacritic

His code works as well as slicing a dstring does - save for the efficiency 
issues. There is no way in Phobos at present to deal with graphemes. All of 
the string processing in Phobos deals with code points. For the most part, 
this works great, but it is true that it isn't complete. I expect that we'll 
get grapheme support eventually (Ibelieve that Dmitry has done some work on a 
grapheme range for the updates that he's been doing to std.regex for GSoC, so 
we may get it from there). But for now, none of the string processing in D 
worries about graphemes - just code points.

- Jonathan M Davis


More information about the phobos mailing list