[review] new string type (take 2)

Fri Jan 14 05:06:45 PST 2011

On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk <stevenw at acres.com.au>  
wrote:

> On 14/01/11 02:25, Steven Schveighoffer wrote:
>  > On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk  
> <stevenw at acres.com.au>
>  > wrote:
>  >
>  >>
>  >> I like the direction you're taking but have some quibbles about
>  >> details. Specifically, I'd go for a more complete separation into
>  >> random-access code-unit ranges and bidirectional code-point ranges:
>  >
>  > Thanks for taking the time. I will respond to your points, but please
>  > make your rebuttals to the new thread I'm about to create with an
>  > updated string type.
>  >
>  >> I don't see a need for _charStart, opIndex, opSlice and codeUnits. If
>  >> the underlying T[] can be returned by a property, then these can be
>  >> done through the code-unit array, which is random-access.
>  >
>  > But that puts extra pain on the user for not much reason. Currently,
>  > strings slice in one operation, you are proposing that we slice in  
> three
>  > operations:
>  >
>  > 1. get the underlying array
>
> myString vs myString.data
>
>  > 2. slice it
>
> Same for both.
>
>  > 3. reconstruct a string based on the slice.
>
> myOtherString = find(myString, 'x');
> vs
> myOtherString = find(myString.data, 'x');
>
> You may see extra pain.  I see extra control.  The user is making it  
> explicit at what level (code-unit/code-point/grapheme/whatever) of range  
> he/she wants the called algorithm to be working on.

Exactly, that is what my string type allows.  You can either do it at the  
code-point (and probably grapheme, discussion in progress) level, or you  
can do it at the code-unit level.  I don't see how restricting the user to  
only doing it at the code-unit level is not more painful.

>  > Plus, if you remove opIndex, you are restricting the usefulness of the
>  > range. Note that this string type already will decode dchars out of  
> the
>  > front and back, why not just give that ability to the middle of the  
> string?
>
> Because at the code-point level it *isn't* a random-access range and the  
> index makes no sense at the code-point level, only at the code-unit  
> level.  It's encouraging the confusion of 2 distinctly different  
> abstractions or "views" of the same data.  All the slicing and indexing  
> you're artificially putting in the code-point range is already available  
> in the code-unit range, and its only benefit is to allow the user to  
> save typing ".data".

I respectfully disagree.  A stream built on fixed-sized units, but with  
variable length elements, where you can determine the start of an element  
in O(1) time given a random index absolutely provides random-access.  It  
just doesn't provide length.

You are also forgetting one thing, the main reason why a string type is  
better than the array -- it's possible to slice a code-unit array using  
indexes that create an invalid range.  With my type it is not possible to  
do that (it throws an exception).  We want the basic user to use strings  
properly (and inform them of their errors at the site of the error), and  
if an advanced user wants more control, they can jump down to the  
code-unit level by accessing the data property.

> - other Steve

hehe, you can be Steve' :)

-Steve