[review] new string type

Thu Dec 2 13:18:52 PST 2010

On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad  
<public at kyllingen.nospamnet> wrote:

> On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
>
>> On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
>> <public at kyllingen.nospamnet> wrote:
>>
>>> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>>>
>>>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>>>> <jmdavisProg at gmx.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>>>> index operator.
>>>>
>>>> Indexing is still O(1).
>>>>
>>>>> 5. Slicing (other than a full slice) is no longer O(1), which
>>>>> violates the
>>>>> guarantees of the slicing operator.
>>>>
>>>> Slicing is still O(1).
>>>>
>>>> [...]
>>>
>>> It feels extremely weird that the indices refer to code units and not
>>> code points.  If I write
>>>
>>>   auto str = mystring("hæ?");
>>>   writeln(str[1], " ", str[2]);
>>>
>>> I expect it to print "æ ?", not "æ æ" like it does now.
>>
>> I don't think it's possible to do that with any implementation without
>> making indexing not O(1).  This just isn't possible, unless you want to
>> use dchar[].
>>
>> But your point is well taken.  I think what I'm going to do is throw an
>> exception when accessing an invalid index.  While also surprising, it
>> doesn't result in "extra data".  I feel it's probably very rare to just
>> access hard-coded indexes like that unless you are sure of the data in
>> the string.  Or to use a for-loop to access characters, etc.
>
> As soon as you add opIndex(), your interface becomes that of a random-
> access range, something which narrow strings are not.  In fact, the
> distinction between random access and bidirectional range access for
> strings is in many ways the reason we're having this discussion.
>
> How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
> adding a characterAt(i) function that retrieves the i'th code point, and
> which is not required to be O(1)?  Then, if someone wants O(1) indexing
> they are forced to use string_t!dchar or just plain ol' arrays, both of
> which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation.  It just doesn't work  
well.  It seems to be awkward at first thought, but the more I think about  
it, the more I think it's right.  When do you ever depend on specific  
indexes in a string being valid, or to be incrementing always by 1?

> I think it's great that you're doing this, by the way!  I haven't made up
> my mind yet about whether I want char[] or a separate string type, but it
> is great to have an actual implementation of the latter at hand when
> debating it.

Thanks :)

-Steve