Proposal for fixing dchar ranges

Tue Mar 11 01:57:43 PDT 2014

On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer 
wrote:
> On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin 
> <john.loughran.colvin at gmail.com> wrote:
>
>> On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
>> wrote:
>>> I proposed this inside the long "major performance problem 
>>> with std.array.front," I've also proposed it before, a long 
>>> time ago.
>>>
>>> But seems to be getting no attention buried in that thread, 
>>> not even negative attention :)
>>>
>>> An idea to fix the whole problems I see with char[] being 
>>> treated specially by phobos: introduce an actual string type, 
>>> with char[] as backing, that is a dchar range, that actually 
>>> dictates the rules we want. Then, make the compiler use this 
>>> type for literals.
>>>
>>> e.g.:
>>>
>>> struct string {
>>>   immutable(char)[] representation;
>>>   this(char[] data) { representation = data;}
>>>   ... // dchar range primitives
>>> }
>>>
>>> Then, a char[] array is simply an array of char[].
>>>
>>> points:
>>>
>>> 1. No more issues with foreach(c; "cassé"), it iterates via 
>>> dchar
>>> 2. No more issues with "cassé"[4], it is a static compiler 
>>> error.
>>> 3. No more awkward ASCII manipulation using ubyte[].
>>> 4. No more phobos schizophrenia saying char[] is not an array.
>>> 5. No more special casing char[] array templates to fool the 
>>> compiler.
>>> 6. Any other special rules we come up with can be dictated by 
>>> the library, and not ignored by the compiler.
>>>
>>> Note, std.algorithm.copy(string1, mutablestring) will still 
>>> decode/encode, but it's more explicit. It's EXPLICITLY a 
>>> dchar range. Use std.algorithm.copy(string1.representation, 
>>> mutablestring.representation) will avoid the issues.
>>>
>>> I imagine only code that is currently UTF ignorant will 
>>> break, and that code is easily 'fixed' by adding the 
>>> 'representation' qualifier.
>>>
>>> -Steve
>>
>> just to check I understand this fully:
>>
>> in this new scheme, what would this do?
>>
>> auto s = "cassé".representation;
>> foreach(i, c; s) write(i, ':', c, ' ');
>> writeln(s);
>>
>> Currently - without the .representation - I get
>>
>> 0:c 1:a 2:s 3:s 4:e 5:̠6:`
>> cassé
>>
>> or, to spell it out a bit more:
>> 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
>> cassé
>
> The plan is for foreach on s to iterate by char, and foreach on 
> "cassé" to iterate by dchar.
>
> What this means is the accent will be iterated separately from 
> the e, and likely gets put onto the colon after 5. However, the 
> half code-units that has no meaning anywhere (xCC and X81) 
> would not be iterated.
>
> In your above code, using .representation would be equivalent 
> to what it is now without .representation (i.e. over char), and 
> without .representation would be equivalent to this on today's 
> compiler (except faster):
>
> foreach(i, dchar c; s)
>
> -Steve

Awesome, let's do this :)