[phobos] UTF-8 string slicing

Fri Aug 19 13:22:24 PDT 2011

On Friday, August 19, 2011 03:07 unDEFER wrote:
> On Fri, 19 Aug 2011 06:53:37 +0400, Jonathan M Davis <jmdavisProg at gmx.com>
> 
> wrote:
> > Hmmm. Such a function isn't entirely a bad idea, but it also makes me a
> > bit
> > nervous. Slicing is efficient. The slice function that you suggest is
> > not. I
> > mean, it's efficient enough for what it's doing, but it's not O(1) like
> > slicing
> > is, so having a slice function could be a bit misleading.
> 
> I know that it is not efficient, but here just appears the question why D
> have decided not support 8-but encodings. Only its makes operations like
> this efficient.
> 
> > Once drop has been merged in, you'll be able do to this
> > auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex));
> > to get the same effect. It may be worth adding such a function though.
> 
> I'm sorry, but looks like there is no "drop()" function.
> Anyway, thank you. I really don't understand how takeExactly works, but it
> works. For newbies it is really not obvious that std.range works fine with
> UTF-8 strings.

I said "once drop has been merged in, you'll be able to..." It's not in yet. 
There's a pull request for it (which was merged in this morning actually), and 
it's going to be in before the next release, but it's not in yet.

std.range most definitely works with UTF-8 strings. _All_ strings are 
considered ranges of dchar. And as ranges, strings of char and wchar are not 
considered sliceable or random access, and they have no length property (as 
none of that works when multiple elements in the array make up a single 
element in the range).

std.range.take creates a range with up to n elements of the range that it's 
given. It's not the same type as the original range, since it's lazy and takes 
elements from the original range only as you iterate it (it would take less 
than n elements from the range if there were fewer than n elements in the 
range, otherwise it takse no elements).

std.range.takeExactly takes exactly n elements from the range, and if the 
range defines a length property, then it returns the exact same type. I was 
thinking that it managed to return the exact same type for strings as well, in 
spite of the fact that it has no length property, but it does not appear that 
it does. So, if you need the type to be string specifical yas opposed to a 
generic range of dchar, then takeExactly isn't going to work. You could call 
std.array.array on it to get a string again, but that's creating a new string, 
which obviously isn't as efficient.

I would point out though that what's generally done when someone needs random 
access to a string is to use dstring. So, if you're really looking to take 
slices out of the middle of a string like that, it's better to just use 
dstring. It _is_ sliceable and has a length property, because each element in 
an array of dchar is a dchar, unlike arrays of char and wchar, where multiple 
elements are required to make a dchar.

> > Certainly
> > auto s = slice(firstIndex, lastIndex);
> > is cleaner. If we add it though, then we should probably give it a
> > different name. Maybe sliceByElementType? That does seem a bit long
> > though, if accurate.

That would make sense if we restricted it to strings, but if we added the 
function, it would be useful for any range which didn't define a length 
property, so we wouldn't be making it string-specific, and so subString 
wouldn't make any sense as a function name. Though, come to think of it, for 
any type of range other than an array of char or wchar, such a function would 
not be able to return the original type, so it's value is certainly less in 
the general case.

Regardless, given the inefficiencies involved, I think that we should be 
discouraging taking random slices of strings or wstrings. There's no reason to 
make it so that you can't do it, but including a function in Phobos to do it 
makes it overly easy IMHO. Someone who needs to be taking slices from the 
middle of strings like that really should be using dstrings in most cases. If 
it's a bit ugly to slice the middle of a string, that's probably a good thing.

As Sean pointed out, std.utf.toUCSindex (which should probably be renamed to 
toUCSIndex to be properly camelcased, but I don't know if we'll fix that or 
not) will give you the index into the string that you need.

auto firstIndex = str.toUCSindex(7);
auto lastIndex = str[firstIndex .. $].toUCSindex(8);
auto slice = str[firstIndex .. lastIndex];

should give you the equivalent of str[7 .. 15] if str were a dstring. You 
could also do it as

auto slice = str[str.toUCSindex(7) .. str.toUCSindex(15];

which would be clearer, but it would also be less efficient.

So, we _might_ add a slicing function to Phobos, but I'm skepitical of the 
wisdom of making it that easy to slice a string or wstring like that given how 
inefficient it is. std.utf already makes it possible in as efficient a manner 
as is possible - just not in as concise a way - and if you're really taking 
slices out of the middle of a string, you really should be doing it with 
dstrings. It's far more efficient that way.

- Jonathan M Davis