[review] new string type

Fri Dec 3 05:59:52 PST 2010

On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad  
<public at kyllingen.nospamnet> wrote:

> On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:
>
>> On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
>> <public at kyllingen.nospamnet> wrote:
>>
>>> On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
>>>
>>>> On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
>>>> <public at kyllingen.nospamnet> wrote:
>>>>
>>>>> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>>>>>
>>>>>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>>>>>> <jmdavisProg at gmx.com> wrote:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>>>>>> index operator.
>>>>>>
>>>>>> Indexing is still O(1).
>>>>>>
>>>>>>> 5. Slicing (other than a full slice) is no longer O(1), which
>>>>>>> violates the
>>>>>>> guarantees of the slicing operator.
>>>>>>
>>>>>> Slicing is still O(1).
>>>>>>
>>>>>> [...]
>>>>>
>>>>> It feels extremely weird that the indices refer to code units and not
>>>>> code points.  If I write
>>>>>
>>>>>   auto str = mystring("hæ?");
>>>>>   writeln(str[1], " ", str[2]);
>>>>>
>>>>> I expect it to print "æ ?", not "æ æ" like it does now.
>>>>
>>>> I don't think it's possible to do that with any implementation without
>>>> making indexing not O(1).  This just isn't possible, unless you want
>>>> to use dchar[].
>>>>
>>>> But your point is well taken.  I think what I'm going to do is throw
>>>> an exception when accessing an invalid index.  While also surprising,
>>>> it doesn't result in "extra data".  I feel it's probably very rare to
>>>> just access hard-coded indexes like that unless you are sure of the
>>>> data in the string.  Or to use a for-loop to access characters, etc.
>>>
>>> As soon as you add opIndex(), your interface becomes that of a random-
>>> access range, something which narrow strings are not.  In fact, the
>>> distinction between random access and bidirectional range access for
>>> strings is in many ways the reason we're having this discussion.
>>>
>>> How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
>>> adding a characterAt(i) function that retrieves the i'th code point,
>>> and which is not required to be O(1)?  Then, if someone wants O(1)
>>> indexing they are forced to use string_t!dchar or just plain ol'
>>> arrays, both of which have clear, predictable indexing semantics.
>>
>> Then substring (slicing) becomes an O(n) operation.  It just doesn't
>> work well.
>
> What I meant wast that opSlice() should be disabled in the same way as
> opIndex().  What you have now is
>
>     struct string_t(T)
>     {
>         dchar opIndex(size_t);               // Must be O(1)
>         string_t opSlice(size_t, size_t);    // Must be O(1)
>     }
>
> and what I'm suggesting is
>
>     struct string_t(T)
>     {
>         dchar character(size_t);             // May be O(n)
>         string_t substring(size_t, size_t);  // May be O(n)
>
>         static if (is(T == dchar))
>         {
>             alias character opIndex;         // Must be O(1)
>             alias substring opSlice;         // Must be O(1)
>         }
>     }

O(n) slicing and indexing, called by different names, is unacceptable.  It  
is possible to have O(1) indexing and slicing, I don't see why we should  
disallow it.  If nothing else, it cripples this implementation to the  
point where it won't be accepted, simply because it performs worse than  
the current solution.

Now, if one wants to have that feature in *addition* to O(1)  
indexing/slicing, I don't think that's an issue.

>
>> It seems to be awkward at first thought, but the more I
>> think about it, the more I think it's right.  When do you ever depend on
>> specific indexes in a string being valid, or to be incrementing always
>> by 1?
>
> That is a good question indeed.  But I'm still not convinced. :)
>
> Another thing to consider is that by having opIndex() in there, your
> string satisfies isRandomAccessRange.  Then, some algorithms which work
> with both bidirectional and random access may choose to go with the
> latter.  This is a quote from the std.algorithm.find() docs:
>
>     Specializations taking advantage of bidirectional or
>     random access (where present) may accelerate search [...]

I just looked it up, isRandomAccess requires hasLength, which my string  
type does not.  So my string type is not a true random access range (this  
was intentional).

-Steve