[review] new string type

Fri Dec 3 11:52:51 PST 2010

On Friday, December 03, 2010 05:13:50 Lars T. Kyllingstad wrote:
> On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:
> > On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
> > 
> > <public at kyllingen.nospamnet> wrote:
> >> On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
> >>> On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
> >>> 
> >>> <public at kyllingen.nospamnet> wrote:
> >>>> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
> >>>>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
> >>>>> <jmdavisProg at gmx.com> wrote:
> >>>>> 
> >>>>> [...]
> >>>>> 
> >>>>>> 4. Indexing is no longer O(1), which violates the guarantees of the
> >>>>>> index operator.
> >>>>> 
> >>>>> Indexing is still O(1).
> >>>>> 
> >>>>>> 5. Slicing (other than a full slice) is no longer O(1), which
> >>>>>> violates the
> >>>>>> guarantees of the slicing operator.
> >>>>> 
> >>>>> Slicing is still O(1).
> >>>>> 
> >>>>> [...]
> >>>> 
> >>>> It feels extremely weird that the indices refer to code units and not
> >>>> code points.  If I write
> >>>> 
> >>>>   auto str = mystring("hæ?");
> >>>>   writeln(str[1], " ", str[2]);
> >>>> 
> >>>> I expect it to print "æ ?", not "æ æ" like it does now.
> >>> 
> >>> I don't think it's possible to do that with any implementation without
> >>> making indexing not O(1).  This just isn't possible, unless you want
> >>> to use dchar[].
> >>> 
> >>> But your point is well taken.  I think what I'm going to do is throw
> >>> an exception when accessing an invalid index.  While also surprising,
> >>> it doesn't result in "extra data".  I feel it's probably very rare to
> >>> just access hard-coded indexes like that unless you are sure of the
> >>> data in the string.  Or to use a for-loop to access characters, etc.
> >> 
> >> As soon as you add opIndex(), your interface becomes that of a random-
> >> access range, something which narrow strings are not.  In fact, the
> >> distinction between random access and bidirectional range access for
> >> strings is in many ways the reason we're having this discussion.
> >> 
> >> How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
> >> adding a characterAt(i) function that retrieves the i'th code point,
> >> and which is not required to be O(1)?  Then, if someone wants O(1)
> >> indexing they are forced to use string_t!dchar or just plain ol'
> >> arrays, both of which have clear, predictable indexing semantics.
> > 
> > Then substring (slicing) becomes an O(n) operation.  It just doesn't
> > work well.
> 
> What I meant wast that opSlice() should be disabled in the same way as
> opIndex().

A string type without slicing (which must be O(1)) is DOA without question. 
Slicing is _far_ too useful to lose. Indexing in strings is fairly rare because 
it's generally  stupid idea, but slicing happens all the time. If nothing else, 
that is _the_ way to get a substring.

- Jonathan M Davis