Ranges

Sat Mar 12 23:20:11 PST 2011

Hi Jonathan,

    Thank you very much your in depth answer!

    It should indeed goto a faq somewhere it think. I did now about the 
codepoint/unit stuff but had no idea that ranges of char are handled 
using dchar internally. This makes sense but is an easy pitfall for 
newcomers trying to use std.{algoritm,array,ranges} for char[].

Thanks
Jonas

On 13/03/11 01.05, Jonathan M Davis wrote:
> On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
>> Hi,
>>
>>      I'm working a bit with ranges atm. but there are definitely some
>> things that are not clear to me yet. Can anyone tell me why the char
>> arrays cannot be copied but the int arrays can?
>>
>> import std.stdio;
>> import std.algorithm;
>>
>> void main(string[] args) {
>>
>>     // This works
>>     int[]	a1 = [1,2,3,4];
>>     int[] a2 = [5,6,7,8];
>>     copy(a1, a2);
>>
>>     // This does not!
>>     char[] a3 = ['1','2','3','4'];
>>     char[] a4 = ['5','6','7','8'];
>>     copy(a3, a4);
>>
>> }
>>
>> Error message:
>>
>> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
>> (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
>> does not match any function template declaration
>>
>> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
>> (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
>> cannot deduce template function from argument types !()(char[],char[])
>
> Character arrays / strings are not exactly normal. And there's a very good
> reason for it: unicode.
>
> In unicode, a character is generally a single code point (there are also
> graphemes which involve combining code points to add accents and superscripts
> and whatnot to create a single character, but we'll ignore that in this
> discussion - it's complicated enough as it is). Depending on the encoding, that
> code point may be made up of one - or more - code units. UTF-8 uses 8 bit code
> units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is
> a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit.
> UTF-32 is the _only_ one of those three which _always_ has one code unit per
> code point.
>
> With an array of integers you can index it and slice it and be sure that
> everything that you're doing is valid. If you look at a single element, you know
> that it's a valid int. If you slice it, you know that every int in there is
> valid. If you're dealing with a dstring or dchar[], then the same still holds.
>
> A dstring or dchar[] is an array of UTF-32 code units. Every code point is a
> single code unit, so every element in the array is a valid code point. You can
> take an arbitrary element in that array and know that it's a valid code point.
> You can slice it wherever you want and you still have a valid dstrin
> g or dchar[]. The same does _not_ hold for char[] and wchar[].
>
> char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In
> both of those encodings, multiple code units are required to create a single
> code point. So, for instance, a code point could have 4 code units. That means
> that _4_ elements of that char[] make up a _single_ code point. You'd need _all_
> 4 of those elements to create a single, valid character. So, you _can't_ just
> take an arbitrary element in a char[] or wchar[] and expect it to be valid. You
> _can't_ just slice it anywhere. The resulting array stands a good chance of
> being invalid. You have to slice on code point boundaries - otherwise you could
> slice characters in hald and end up with an invalid string. So, unlike other
> arrays, it just doesn't work to treat char[] and wchar[] as random access ranges
> of their element type. What the programmer cares about is characters - dchars -
> not chars or wchars.
>
> So, the way this is handled is that char[], wchar[], and dchar[] are all treated
> as ranges of dchar. In the case of dchar[], this is nothing special. You can
> index it and slice it as normal. So, it is a random access range.. However, in
> the case of char[] and wchar[], that means that when you're iterating over them
> that you're not dealing with a single element of the array at a time. front
> returns a dchar, and popFront() pops off however many elements made up front.
> It's like with foreach. If you iterate a char[] with auto or char, then each
> individual element is given
>
> foreach(c; myStr) {}
>
> But if you iterate over with dchar, then each code point is given as a dchar:
>
> foreach(dchar c; myStr) {}
>
> If you were to try and iterate over a char[] by char, then you would be looking
> at code units rather than code points which is _rarely_ what you want. If you're
> dealing with anything other than pure ASCII, you _will_ have bugs if you do
> that. You're supposed to use dchar with foreach and character arrays. That way,
> each value you process is a valid character. Ranges do the same, only you don't
> give them an iteration type, so they're _always_ iterating over dchar.
>
> So, when you're using a range of char[] or wchar[], you're really using a range
> of dchar. These ranges are bi-directional. They can't be sliced, and they can't
> be indexed (since doing so would likely be invalid). This generally works very
> well. It's exactly what you want in most cases. The problem is that that means
> that the range that you're iterating over is effectively of a different type than
> the original char[] or wchar[].
>
> You can't just take two ranges of dchar of the same length and necessarily have
> them fit in the same char[] or wchar[]. They have the same length, because they
> have the same number of code points. However, they could have a different number
> of code _units_, so the lengths of the actual arrays could differ. So, you can't
> just take an arbitrary dchar range and copy it to another arbitrary dchar range.
>
> The way that this is dealt with in the case of a function like copy is that what
> you're copying _to_ must be an output range. char[] and wchar[] are _not_ output
> ranges, because of their differing number of code units per code point. So, they
> don't work with copy. You need to use a dchar[] as the output range if you want
> to use strings with copy.
>
> Now, in some cases, it might be possible to special case some of the range
> functions to treat char[] and wchar[] as arrays instead of ranges (in the case
> of copy, that's probably possible if both arguments are of the same type), but
> that can't be done in the general case. You could open an enhancement request
> for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of
> the same type.
>
> - Jonathan M Davis