Proposal for fixing dchar ranges

Mon Mar 10 15:15:45 PDT 2014

On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin  
<john.loughran.colvin at gmail.com> wrote:

> On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
>> I proposed this inside the long "major performance problem with  
>> std.array.front," I've also proposed it before, a long time ago.
>>
>> But seems to be getting no attention buried in that thread, not even  
>> negative attention :)
>>
>> An idea to fix the whole problems I see with char[] being treated  
>> specially by phobos: introduce an actual string type, with char[] as  
>> backing, that is a dchar range, that actually dictates the rules we  
>> want. Then, make the compiler use this type for literals.
>>
>> e.g.:
>>
>> struct string {
>>    immutable(char)[] representation;
>>    this(char[] data) { representation = data;}
>>    ... // dchar range primitives
>> }
>>
>> Then, a char[] array is simply an array of char[].
>>
>> points:
>>
>> 1. No more issues with foreach(c; "cassé"), it iterates via dchar
>> 2. No more issues with "cassé"[4], it is a static compiler error.
>> 3. No more awkward ASCII manipulation using ubyte[].
>> 4. No more phobos schizophrenia saying char[] is not an array.
>> 5. No more special casing char[] array templates to fool the compiler.
>> 6. Any other special rules we come up with can be dictated by the  
>> library, and not ignored by the compiler.
>>
>> Note, std.algorithm.copy(string1, mutablestring) will still  
>> decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.  
>> Use std.algorithm.copy(string1.representation,  
>> mutablestring.representation) will avoid the issues.
>>
>> I imagine only code that is currently UTF ignorant will break, and that  
>> code is easily 'fixed' by adding the 'representation' qualifier.
>>
>> -Steve
>
> just to check I understand this fully:
>
> in this new scheme, what would this do?
>
> auto s = "cassé".representation;
> foreach(i, c; s) write(i, ':', c, ' ');
> writeln(s);
>
> Currently - without the .representation - I get
>
> 0:c 1:a 2:s 3:s 4:e 5:̠6:`
> cassé
>
> or, to spell it out a bit more:
> 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
> cassé

The plan is for foreach on s to iterate by char, and foreach on "cassé"  
to iterate by dchar.

What this means is the accent will be iterated separately from the e, and  
likely gets put onto the colon after 5. However, the half code-units that  
has no meaning anywhere (xCC and X81) would not be iterated.

In your above code, using .representation would be equivalent to what it  
is now without .representation (i.e. over char), and without  
.representation would be equivalent to this on today's compiler (except  
faster):

foreach(i, dchar c; s)

-Steve