[phobos] UTF-8 string slicing

Jonathan M Davis jmdavisProg at gmx.com
Sat Aug 20 16:51:05 PDT 2011


On Saturday, August 20, 2011 13:11:44 unDEFER wrote:
> Big thanks, Jonathan!
> You give me very clearly explanations.
> But what you mean by "strings of char and wchar ... have no length
> property" if "string.length" really works? Is it a bug?

All arrays have a length property. It returns the number of elements in the 
array. The issue is std.range.hasLength, which is what is used with range-
based functions in template constraints and static ifs. hasLength is true for 
all arrays _except_ for arrays of char and wchar. This is because strings are 
ranges of dchar - of code points - whereas they are arrays of code units, and 
in UTF-8 and UTF-16, there can be more than one code unit per code point. In 
the general case, calling length on an array of char or wchar isn't going to 
give you the the number of code points in the array. So, it's normally 
incorrect to use length with arrays of char and wchar in range-based 
functions.

string str = "hello world";
assert(str.length == walkLength(str));

This works, because it only uses ASCII characters which all fit in one code 
unit. Whereas this doesn't

auto str = "Привет";
assert(str.length == walkLength(str));

since the characters are more than one code unit each. walkLength uses the 
length property if hasLength is true, but otherwise it iterates over the whole 
array and counts how many elements that there are. So, in range-based 
functions, we use walkLength, not length, unless it is a section of code where 
we know though the range has a length property and that using it directly is 
correct (based on the template constraint and/or static ifs that the block of 
code is in).

- Jonathan M Davis


More information about the phobos mailing list