Proposal for fixing dchar ranges
John Colvin
john.loughran.colvin at gmail.com
Tue Mar 11 01:57:43 PDT 2014
On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer
wrote:
> On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin
> <john.loughran.colvin at gmail.com> wrote:
>
>> On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer
>> wrote:
>>> I proposed this inside the long "major performance problem
>>> with std.array.front," I've also proposed it before, a long
>>> time ago.
>>>
>>> But seems to be getting no attention buried in that thread,
>>> not even negative attention :)
>>>
>>> An idea to fix the whole problems I see with char[] being
>>> treated specially by phobos: introduce an actual string type,
>>> with char[] as backing, that is a dchar range, that actually
>>> dictates the rules we want. Then, make the compiler use this
>>> type for literals.
>>>
>>> e.g.:
>>>
>>> struct string {
>>> immutable(char)[] representation;
>>> this(char[] data) { representation = data;}
>>> ... // dchar range primitives
>>> }
>>>
>>> Then, a char[] array is simply an array of char[].
>>>
>>> points:
>>>
>>> 1. No more issues with foreach(c; "cassé"), it iterates via
>>> dchar
>>> 2. No more issues with "cassé"[4], it is a static compiler
>>> error.
>>> 3. No more awkward ASCII manipulation using ubyte[].
>>> 4. No more phobos schizophrenia saying char[] is not an array.
>>> 5. No more special casing char[] array templates to fool the
>>> compiler.
>>> 6. Any other special rules we come up with can be dictated by
>>> the library, and not ignored by the compiler.
>>>
>>> Note, std.algorithm.copy(string1, mutablestring) will still
>>> decode/encode, but it's more explicit. It's EXPLICITLY a
>>> dchar range. Use std.algorithm.copy(string1.representation,
>>> mutablestring.representation) will avoid the issues.
>>>
>>> I imagine only code that is currently UTF ignorant will
>>> break, and that code is easily 'fixed' by adding the
>>> 'representation' qualifier.
>>>
>>> -Steve
>>
>> just to check I understand this fully:
>>
>> in this new scheme, what would this do?
>>
>> auto s = "cassé".representation;
>> foreach(i, c; s) write(i, ':', c, ' ');
>> writeln(s);
>>
>> Currently - without the .representation - I get
>>
>> 0:c 1:a 2:s 3:s 4:e 5:̠6:`
>> cassé
>>
>> or, to spell it out a bit more:
>> 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
>> cassé
>
> The plan is for foreach on s to iterate by char, and foreach on
> "cassé" to iterate by dchar.
>
> What this means is the accent will be iterated separately from
> the e, and likely gets put onto the colon after 5. However, the
> half code-units that has no meaning anywhere (xCC and X81)
> would not be iterated.
>
> In your above code, using .representation would be equivalent
> to what it is now without .representation (i.e. over char), and
> without .representation would be equivalent to this on today's
> compiler (except faster):
>
> foreach(i, dchar c; s)
>
> -Steve
Awesome, let's do this :)
More information about the Digitalmars-d
mailing list