[review] new string type
Steven Schveighoffer
schveiguy at yahoo.com
Thu Dec 2 13:18:52 PST 2010
On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
<public at kyllingen.nospamnet> wrote:
> On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
>
>> On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
>> <public at kyllingen.nospamnet> wrote:
>>
>>> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>>>
>>>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>>>> <jmdavisProg at gmx.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>>>> index operator.
>>>>
>>>> Indexing is still O(1).
>>>>
>>>>> 5. Slicing (other than a full slice) is no longer O(1), which
>>>>> violates the
>>>>> guarantees of the slicing operator.
>>>>
>>>> Slicing is still O(1).
>>>>
>>>> [...]
>>>
>>> It feels extremely weird that the indices refer to code units and not
>>> code points. If I write
>>>
>>> auto str = mystring("hæ?");
>>> writeln(str[1], " ", str[2]);
>>>
>>> I expect it to print "æ ?", not "æ æ" like it does now.
>>
>> I don't think it's possible to do that with any implementation without
>> making indexing not O(1). This just isn't possible, unless you want to
>> use dchar[].
>>
>> But your point is well taken. I think what I'm going to do is throw an
>> exception when accessing an invalid index. While also surprising, it
>> doesn't result in "extra data". I feel it's probably very rare to just
>> access hard-coded indexes like that unless you are sure of the data in
>> the string. Or to use a for-loop to access characters, etc.
>
> As soon as you add opIndex(), your interface becomes that of a random-
> access range, something which narrow strings are not. In fact, the
> distinction between random access and bidirectional range access for
> strings is in many ways the reason we're having this discussion.
>
> How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
> adding a characterAt(i) function that retrieves the i'th code point, and
> which is not required to be O(1)? Then, if someone wants O(1) indexing
> they are forced to use string_t!dchar or just plain ol' arrays, both of
> which have clear, predictable indexing semantics.
Then substring (slicing) becomes an O(n) operation. It just doesn't work
well. It seems to be awkward at first thought, but the more I think about
it, the more I think it's right. When do you ever depend on specific
indexes in a string being valid, or to be incrementing always by 1?
> I think it's great that you're doing this, by the way! I haven't made up
> my mind yet about whether I want char[] or a separate string type, but it
> is great to have an actual implementation of the latter at hand when
> debating it.
Thanks :)
-Steve
More information about the Digitalmars-d
mailing list