[review] new string type

Wed Dec 1 13:44:42 PST 2010

On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad  
<public at kyllingen.nospamnet> wrote:

> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>
>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>> <jmdavisProg at gmx.com> wrote:
>>
>> [...]
>>
>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>> index operator.
>>
>> Indexing is still O(1).
>>
>>> 5. Slicing (other than a full slice) is no longer O(1), which violates
>>> the
>>> guarantees of the slicing operator.
>>
>> Slicing is still O(1).
>>
>> [...]
>
> It feels extremely weird that the indices refer to code units and not
> code points.  If I write
>
>   auto str = mystring("hæ?");
>   writeln(str[1], " ", str[2]);
>
> I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without  
making indexing not O(1).  This just isn't possible, unless you want to  
use dchar[].

But your point is well taken.  I think what I'm going to do is throw an  
exception when accessing an invalid index.  While also surprising, it  
doesn't result in "extra data".  I feel it's probably very rare to just  
access hard-coded indexes like that unless you are sure of the data in the  
string.  Or to use a for-loop to access characters, etc.

> On a side note:  It seems to me that the only reason to have char, wchar,
> and dchar as separate types in the language is that arrays of said types
> are UTF-encoded strings.  If a type such as the proposed one were to
> become the default string type in D, it might as well wrap an array of
> ubyte/ushort/uint, since direct user manipulation of the underlying array
> will generally only happen in the rare cases when one wants to deal
> directly with code units.

I'd still want a char[] array for easy manipulation and eventual  
printing.  Wrapping a ubyte[] with a string just to print would be  
strange.  My goal in this exercise is to try and give control of what to  
deal with (code-points vs. code-units) back to the user.  Right now, the  
library forces you to view them as code-points, and the compiler forces  
you to view them as code-units (except via foreach).

But it is an interesting idea (removing small char types).

-Steve