[review] new string type (take 2)

Sat Jan 15 14:04:13 PST 2011

On 1/15/11 2:32 PM, Steven Schveighoffer wrote:
> On Sat, 15 Jan 2011 13:19:34 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>
>> On 1/15/11 11:15 CST, Steven Schveighoffer wrote:
>>> On Fri, 14 Jan 2011 12:03:28 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> wrote:
>>>
>>>> On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
>>>>> On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk
>>>>> <stevenw at acres.com.au>
>>>>> wrote:
>>>>>
>>>>>> Because at the code-point level it *isn't* a random-access range and
>>>>>> the index makes no sense at the code-point level, only at the
>>>>>> code-unit level. It's encouraging the confusion of 2 distinctly
>>>>>> different abstractions or "views" of the same data. All the slicing
>>>>>> and indexing you're artificially putting in the code-point range is
>>>>>> already available in the code-unit range, and its only benefit is to
>>>>>> allow the user to save typing ".data".
>>>>>
>>>>> I respectfully disagree. A stream built on fixed-sized units, but with
>>>>> variable length elements, where you can determine the start of an
>>>>> element in O(1) time given a random index absolutely provides
>>>>> random-access. It just doesn't provide length.
>>>>
>>>> I equally respectfully disagree. I think random access is defined as
>>>> accessing the ith element in O(1) time. That's not the case here.
>>>
>>> I'd actually go as far as saying that any container you can do a quick
>>> lookup given an index (O(lgn) or better) is random-access. For example,
>>> a dictionary file of english words sorted alphabetically can look up any
>>> word definition in O(lgn) time, but you can't look up the nth word in
>>> O(lgn) time.
>>
>> Here you are trying to equate a dictionary (map/hashtable) with an
>> otherwise unstructured variable length encoded array. I don't think
>> there is much similarity.
>
> I see a lot of similarity. Both have sparsely populated indexes. Both
> cannot determine their length unless you save it separately. Both are
> encoded on an array of fixed-length elements. Both store their data
> sorted by index.
>
> The difference is that the index is a size_t for one and a string for
> the other.

I think this is way off. We're comparing a highly structured topology 
with a simple linear encoding.

>>> What do you call this type of container if not random access? Indexed
>>> maybe? Whatever term you call it, I'd call it that, and say opIndex
>>> belongs for that container.
>>>
>>> -Steve
>>
>> I'm not sure how you can call it, but it's definitely not random access.
>
> The former complaint was "because this != random access, opIndex doesn't
> belong."
>
> If we can get past the terminology of what this is, I think at least
> opSlice belongs and I still think opIndex belongs. Can you not see the
> value of 'give me the element at index X' giving you back a full element
> and not just a code-unit?

I do see the value. I think it's essential to reckon that (1) that is 
not random access, (2) that allows people to write wrong code because 
they believe they have random access.

> If opIndex doesn't belong unless it's a random access container (i.e.
> array), then we should get rid of the operator for hashes and sorted trees.

The existence of opIndex for hashes and ordered trees is not being 
contended, and throws red herrings into the discussion.

It's all very simple. At the end of the day, a string type must make a 
Boolean decision: the user is aware of an underlying variable-length 
encoding, or not.

Allow me to enumerate the ways I know that hide the variable encoding 
100% from the user:

1. Expose a bidirectional range.

Now, for a variety of reasons, it is efficient to have access to the 
index in the supporting structure for the encoding. That's great, but it 
should be understood that offering such access allows people who don't 
understand the underlying variable-length encoding to write wrong code.

Bottom line: you can't abstract away the variable-length encoding and 
offer indexes at the same time.

Andrei