Go and generic programming on reddit, also touches on D

Mon Sep 19 08:03:15 PDT 2011

On 09/19/2011 04:43 PM, Steven Schveighoffer wrote:
> On Mon, 19 Sep 2011 10:24:33 -0400, Timon Gehr <timon.gehr at gmx.ch> wrote:
>
>> On 09/19/2011 04:02 PM, Steven Schveighoffer wrote:
>>>
>>> So I think it's not only limiting to require x.length to be $, it's very
>>> wrong in some cases.
>>>
>>> Also, think of a string. It has no length (well technically, it does,
>>> but it's not the number of elements), but it has a distinct end point. A
>>> properly written string type would fail to compile if $ was s.length.
>>>
>>
>> But you'd have to compute the length anyways in the general case:
>>
>> str[0..$/2];
>>
>> Or am I misunderstanding something?
>>
>
> That's half the string in code units, not code points.
>
> If string was properly implemented, this would fail to compile. $ is not
> the length of the string range (meaning the number of code points). The
> given slice operation might actually create an invalid string.

Programmers have to be aware of that if they want efficient code that 
deals with unicode. I think having random access to the code units and 
being able to iterate per code point is fine, because it gives you the 
best of both worlds. Manually decoding a string and slicing it at 
positions that were remembered to be safe has been good enough for me, 
at least it is efficient.

>
> It's tricky, because you want fast slicing, but only certain slices are
> valid. I once created a string type that used a char[] as its backing,
> but actually implemented the limitations that std.range tries to enforce
> (but cannot). It's somewhat of a compromise. If $ was mapped to
> s.length, it would fail to compile, but I'm not sure what I *would* use
> for $. It actually might be the code units, which would not make the
> above line invalid.
>
> -Steve

Well it would have to be consistent for a string type that "does it 
right" . Either the string is indexed with units or it is indexed with 
code points, and the other option should be provided. Dollar should just 
be the length of what is used for indexing/slicing here, and having that 
be different from length makes for a somewhat awkward interface imho.

Btw, D double-quoted string literals let you define invalid byte 
sequences with eg. octal literals:
string s="\377";

What would be use cases for that? Shouldn't \377 map to the extended 
ascii charset instead and yield the same code point that would be given 
in C dq strings?