Ranges
Jonathan M Davis
jmdavisProg at gmx.com
Sat Mar 12 16:05:37 PST 2011
On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
> Hi,
>
> I'm working a bit with ranges atm. but there are definitely some
> things that are not clear to me yet. Can anyone tell me why the char
> arrays cannot be copied but the int arrays can?
>
> import std.stdio;
> import std.algorithm;
>
> void main(string[] args) {
>
> // This works
> int[] a1 = [1,2,3,4];
> int[] a2 = [5,6,7,8];
> copy(a1, a2);
>
> // This does not!
> char[] a3 = ['1','2','3','4'];
> char[] a4 = ['5','6','7','8'];
> copy(a3, a4);
>
> }
>
> Error message:
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> does not match any function template declaration
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> cannot deduce template function from argument types !()(char[],char[])
Character arrays / strings are not exactly normal. And there's a very good
reason for it: unicode.
In unicode, a character is generally a single code point (there are also
graphemes which involve combining code points to add accents and superscripts
and whatnot to create a single character, but we'll ignore that in this
discussion - it's complicated enough as it is). Depending on the encoding, that
code point may be made up of one - or more - code units. UTF-8 uses 8 bit code
units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is
a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit.
UTF-32 is the _only_ one of those three which _always_ has one code unit per
code point.
With an array of integers you can index it and slice it and be sure that
everything that you're doing is valid. If you look at a single element, you know
that it's a valid int. If you slice it, you know that every int in there is
valid. If you're dealing with a dstring or dchar[], then the same still holds.
A dstring or dchar[] is an array of UTF-32 code units. Every code point is a
single code unit, so every element in the array is a valid code point. You can
take an arbitrary element in that array and know that it's a valid code point.
You can slice it wherever you want and you still have a valid dstrin
g or dchar[]. The same does _not_ hold for char[] and wchar[].
char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In
both of those encodings, multiple code units are required to create a single
code point. So, for instance, a code point could have 4 code units. That means
that _4_ elements of that char[] make up a _single_ code point. You'd need _all_
4 of those elements to create a single, valid character. So, you _can't_ just
take an arbitrary element in a char[] or wchar[] and expect it to be valid. You
_can't_ just slice it anywhere. The resulting array stands a good chance of
being invalid. You have to slice on code point boundaries - otherwise you could
slice characters in hald and end up with an invalid string. So, unlike other
arrays, it just doesn't work to treat char[] and wchar[] as random access ranges
of their element type. What the programmer cares about is characters - dchars -
not chars or wchars.
So, the way this is handled is that char[], wchar[], and dchar[] are all treated
as ranges of dchar. In the case of dchar[], this is nothing special. You can
index it and slice it as normal. So, it is a random access range.. However, in
the case of char[] and wchar[], that means that when you're iterating over them
that you're not dealing with a single element of the array at a time. front
returns a dchar, and popFront() pops off however many elements made up front.
It's like with foreach. If you iterate a char[] with auto or char, then each
individual element is given
foreach(c; myStr) {}
But if you iterate over with dchar, then each code point is given as a dchar:
foreach(dchar c; myStr) {}
If you were to try and iterate over a char[] by char, then you would be looking
at code units rather than code points which is _rarely_ what you want. If you're
dealing with anything other than pure ASCII, you _will_ have bugs if you do
that. You're supposed to use dchar with foreach and character arrays. That way,
each value you process is a valid character. Ranges do the same, only you don't
give them an iteration type, so they're _always_ iterating over dchar.
So, when you're using a range of char[] or wchar[], you're really using a range
of dchar. These ranges are bi-directional. They can't be sliced, and they can't
be indexed (since doing so would likely be invalid). This generally works very
well. It's exactly what you want in most cases. The problem is that that means
that the range that you're iterating over is effectively of a different type than
the original char[] or wchar[].
You can't just take two ranges of dchar of the same length and necessarily have
them fit in the same char[] or wchar[]. They have the same length, because they
have the same number of code points. However, they could have a different number
of code _units_, so the lengths of the actual arrays could differ. So, you can't
just take an arbitrary dchar range and copy it to another arbitrary dchar range.
The way that this is dealt with in the case of a function like copy is that what
you're copying _to_ must be an output range. char[] and wchar[] are _not_ output
ranges, because of their differing number of code units per code point. So, they
don't work with copy. You need to use a dchar[] as the output range if you want
to use strings with copy.
Now, in some cases, it might be possible to special case some of the range
functions to treat char[] and wchar[] as arrays instead of ranges (in the case
of copy, that's probably possible if both arguments are of the same type), but
that can't be done in the general case. You could open an enhancement request
for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of
the same type.
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list