[review] new string type

Lars T. Kyllingstad public at kyllingen.NOSPAMnet
Wed Dec 1 23:09:51 PST 2010


On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

> On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
> <public at kyllingen.nospamnet> wrote:
> 
>> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>>
>>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>>> <jmdavisProg at gmx.com> wrote:
>>>
>>> [...]
>>>
>>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>>> index operator.
>>>
>>> Indexing is still O(1).
>>>
>>>> 5. Slicing (other than a full slice) is no longer O(1), which
>>>> violates the
>>>> guarantees of the slicing operator.
>>>
>>> Slicing is still O(1).
>>>
>>> [...]
>>
>> It feels extremely weird that the indices refer to code units and not
>> code points.  If I write
>>
>>   auto str = mystring("hæ?");
>>   writeln(str[1], " ", str[2]);
>>
>> I expect it to print "æ ?", not "æ æ" like it does now.
> 
> I don't think it's possible to do that with any implementation without
> making indexing not O(1).  This just isn't possible, unless you want to
> use dchar[].
> 
> But your point is well taken.  I think what I'm going to do is throw an
> exception when accessing an invalid index.  While also surprising, it
> doesn't result in "extra data".  I feel it's probably very rare to just
> access hard-coded indexes like that unless you are sure of the data in
> the string.  Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random-
access range, something which narrow strings are not.  In fact, the 
distinction between random access and bidirectional range access for 
strings is in many ways the reason we're having this discussion.

How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead 
adding a characterAt(i) function that retrieves the i'th code point, and 
which is not required to be O(1)?  Then, if someone wants O(1) indexing 
they are forced to use string_t!dchar or just plain ol' arrays, both of 
which have clear, predictable indexing semantics.

I think it's great that you're doing this, by the way!  I haven't made up 
my mind yet about whether I want char[] or a separate string type, but it 
is great to have an actual implementation of the latter at hand when 
debating it.

-Lars


More information about the Digitalmars-d mailing list