[phobos] UTF-8 string slicing
Jonathan M Davis
jmdavisProg at gmx.com
Fri Aug 19 13:22:24 PDT 2011
On Friday, August 19, 2011 03:07 unDEFER wrote:
> On Fri, 19 Aug 2011 06:53:37 +0400, Jonathan M Davis <jmdavisProg at gmx.com>
> > Hmmm. Such a function isn't entirely a bad idea, but it also makes me a
> > bit
> > nervous. Slicing is efficient. The slice function that you suggest is
> > not. I
> > mean, it's efficient enough for what it's doing, but it's not O(1) like
> > slicing
> > is, so having a slice function could be a bit misleading.
> I know that it is not efficient, but here just appears the question why D
> have decided not support 8-but encodings. Only its makes operations like
> this efficient.
> > Once drop has been merged in, you'll be able do to this
> > auto s = takeExactly(drop(str, firstIndex), lastIndex - firstIndex));
> > to get the same effect. It may be worth adding such a function though.
> I'm sorry, but looks like there is no "drop()" function.
> Anyway, thank you. I really don't understand how takeExactly works, but it
> works. For newbies it is really not obvious that std.range works fine with
> UTF-8 strings.
I said "once drop has been merged in, you'll be able to..." It's not in yet.
There's a pull request for it (which was merged in this morning actually), and
it's going to be in before the next release, but it's not in yet.
std.range most definitely works with UTF-8 strings. _All_ strings are
considered ranges of dchar. And as ranges, strings of char and wchar are not
considered sliceable or random access, and they have no length property (as
none of that works when multiple elements in the array make up a single
element in the range).
std.range.take creates a range with up to n elements of the range that it's
given. It's not the same type as the original range, since it's lazy and takes
elements from the original range only as you iterate it (it would take less
than n elements from the range if there were fewer than n elements in the
range, otherwise it takse no elements).
std.range.takeExactly takes exactly n elements from the range, and if the
range defines a length property, then it returns the exact same type. I was
thinking that it managed to return the exact same type for strings as well, in
spite of the fact that it has no length property, but it does not appear that
it does. So, if you need the type to be string specifical yas opposed to a
generic range of dchar, then takeExactly isn't going to work. You could call
std.array.array on it to get a string again, but that's creating a new string,
which obviously isn't as efficient.
I would point out though that what's generally done when someone needs random
access to a string is to use dstring. So, if you're really looking to take
slices out of the middle of a string like that, it's better to just use
dstring. It _is_ sliceable and has a length property, because each element in
an array of dchar is a dchar, unlike arrays of char and wchar, where multiple
elements are required to make a dchar.
> > Certainly
> > auto s = slice(firstIndex, lastIndex);
> > is cleaner. If we add it though, then we should probably give it a
> > different name. Maybe sliceByElementType? That does seem a bit long
> > though, if accurate.
That would make sense if we restricted it to strings, but if we added the
function, it would be useful for any range which didn't define a length
property, so we wouldn't be making it string-specific, and so subString
wouldn't make any sense as a function name. Though, come to think of it, for
any type of range other than an array of char or wchar, such a function would
not be able to return the original type, so it's value is certainly less in
the general case.
Regardless, given the inefficiencies involved, I think that we should be
discouraging taking random slices of strings or wstrings. There's no reason to
make it so that you can't do it, but including a function in Phobos to do it
makes it overly easy IMHO. Someone who needs to be taking slices from the
middle of strings like that really should be using dstrings in most cases. If
it's a bit ugly to slice the middle of a string, that's probably a good thing.
As Sean pointed out, std.utf.toUCSindex (which should probably be renamed to
toUCSIndex to be properly camelcased, but I don't know if we'll fix that or
not) will give you the index into the string that you need.
auto firstIndex = str.toUCSindex(7);
auto lastIndex = str[firstIndex .. $].toUCSindex(8);
auto slice = str[firstIndex .. lastIndex];
should give you the equivalent of str[7 .. 15] if str were a dstring. You
could also do it as
auto slice = str[str.toUCSindex(7) .. str.toUCSindex(15];
which would be clearer, but it would also be less efficient.
So, we _might_ add a slicing function to Phobos, but I'm skepitical of the
wisdom of making it that easy to slice a string or wstring like that given how
inefficient it is. std.utf already makes it possible in as efficient a manner
as is possible - just not in as concise a way - and if you're really taking
slices out of the middle of a string, you really should be doing it with
dstrings. It's far more efficient that way.
- Jonathan M Davis
More information about the phobos