Why Strings as Classes?

Fri Aug 29 15:12:17 PDT 2008

 "Steven Schveighoffer" <schveiguy at yahoo.com> wrote in message 
news:g983f5$2ns2$1 at digitalmars.com...
> "Nick Sabalausky" wrote
>> Ok, so you want foo() to be able to tell if the collection has fast or 
>> slow indexing. What are you suggesting that foo() does when the 
>> collection does have slow indexing?
>
> No, I don't want to be able to tell.  I don't want to HAVE to be able to 
> tell.

You're missing the point. Since, as you say below, you want foo to not be 
callable with the collection since it doesn't implement opIndex, your answer 
is clearly "#1, The program should fail to compile because foo's 
implementation uses [] and the slow-indexing collection doesn't implement 
[]".

> In my ideal world, the collection does not implement opIndex unless it is 
> fast, so there is no issue.  i.e. you cannot call foo with a linked list.
>
> I'm really tired of this argument, you understand my point of view, I 
> understand yours.

..(line split for clarity)..
> To you, the syntax sugar is more important than the complexity guarantees.

Not at all. And to that effect, I've already presented a way that we can 
have both syntactic sugar and, when desired, complexity guarantees. In fact, 
the method I presented actually provides more protection against poor 
complexity than your method (Since the guarantee doesn't break when faced 
with code from people with my viewpoint on [], which as you admit below is 
neither more right nor more wrong than your viewpoint on []). Just because I 
don't agree with your method of implementing complexity guarantees, doesn't 
mean I don't think they can be valuable.

> To me, what the syntax intuitively means should be what it does.

I absolutely agree that "What the syntax intuitively means should be what it 
does". Where we disagree is on "what the [] syntax intuitively means".

> So I'll develop my collections library and you develop yours, fair enough? 
> I don't think either of us is right or wrong in the strict sense of the 
> terms.
>
> To be fair, I'll answer your other points as you took the time to write 
> them.  And then I'm done.  I can't really be any clearer as to what I 
> believe is the best design.
>
>> 1. Should it fail to compile because foo's implementation uses [] and the 
>> slow-indexing collection doesn't implement []?
>
> No, foo will always compile because opIndex should always be fast, and 
> then I can specify the complexity of foo without worry.
>
> Using an O(n) lookup operation should be more painful because it requires 
> more time.  It makes users use it less.
>
>> 2. Should foo revert to an alternate branch of code that doesn't use []?
>>
>> This behavior can be implemented via interfaces like I described. The 
>> benefit of that is that [] can still serve as the shorthand it's intended 
>> for (see below) and you never need to introduce the inconsistency of 
>> "Gee, how do I get the Nth element of a collection?" "Well, on some 
>> collections it's getNth(), and on other collections it's []."
>
> I believe that you shouldn't really ever be calling getNth on a link-list, 
> and if you are, it should be a red flag, like a cast.
>
> Furthermore [] isn't always equivalent to getNth, see below.
>

Addressed below...

>>>> As for the risk that could create of accidentially sending a linked 
>>>> list to a "search" (ie, a "search for an element which contains data 
>>>> X") that uses [] internally instead of iterators (but then, why 
>>>> wouldn't it just use iterators anyway?): I'll agree that in a case like 
>>>> this there should be some mechanism for automatic choosing of an 
>>>> algorithm, but that mechanism should be at a separate level of 
>>>> abstraction. There would be a function "search" that, through either 
>>>> RTTI or template constraints or something else, says "does collection 
>>>> 'c' implement ConstantTimeForewardDirectionIndexing?" or better yet IMO 
>>>> "does the collection have attribute ForewardDirectionIndexingComplexity 
>>>> that is set equal to Complexity.Constant?", and based on that passes 
>>>> control to either IndexingSearch or IteratorSearch.
>>>
>>> To me, this is a bad design.  It's my opinion, but one that is shared 
>>> among many people.  You can do stuff this way, but it is not intuitive. 
>>> I'd much rather reserve opIndex to only quick lookups, and avoid the 
>>> possibility of accidentally using it incorrectly.
>>>
>>
>> Preventing a collection from ever being used in a function that would 
>> typically perform poorly on that collection just smacks of premature 
>> optimization. How do you, as the collection author, know that the 
>> collection will never be used in a way such that *occasional* use in 
>> certain specific sub-optimal a manner might actually be necessary and/or 
>> acceptable?
>
> It's not premature optimization, it's not offering a feature that has 
> little or no use.  It's like any contract for any object, you only want to 
> define the interface for which your object is designed.  A linked list 
> should not have an opIndex because it's not designed to be indexed.
>

Addressed below...

> If I designed a new car with which you could steer each front wheel 
> independently, would that make you buy it?  It's another feature that the 
> car has that other cars don't.  Who cares if it's useful, its another 
> *feature*!  Sometimes a good design is not that a feature is included but 
> that a feature is *not* included.
>

So, in other words, it sounds like you're saying that in my scenario above, 
you think that a linked list should not be usable, even if it is faster in 
the greater context (Without actually saying so directly). Or do you claim 
that the scenario can never happen?

>> If you omit [] then you've burnt the bridge (so to speak) and your only 
>> recourse is to add a standardized "getNth()" to every single collection 
>> which clutters the interface, hinders integration with third-party 
>> collections and algorithms, and is likely to still suffer from idiots who 
>> think that "get Nth element" is always better than O(n) (see below).
>
> I'd reserve getNth for linked lists only, if I implemented it at all.  It 
> is a useless feature.  The only common feature for all containers should 
> be iteration, because 'iterate next element' is always an O(1) operation 
> (amortized in the case of trees).
>
>>> In general, I'd say if you are using lists and frequently looking up the 
>>> nth value in the list, you have chosen the wrong container for the job.
>>>
>>
>> If you're frequently looking up random elements in a list, then yes, 
>> you're probably using the wrong container. But that's beside the point. 
>> Even if you only do it once: If you have a collection with a natural 
>> order, and you want to get the nth element, you should be able to use the 
>> standard "get element at index X" notation, [].
>
> I respectfully disagree.  For the reasons I've stated above.
>
>> I don't care how many people go around using [] and thinking they're 
>> guaranteed to get a cheap computation from it. In a language that 
>> supports overloading of [], the [] means "get the element at key/index 
>> X". Especially in a language like D where using [] on an associative 
>> array can trigger an unbounded allocation and GC run. Using [] in D (and 
>> various other languages) can be expensive, period, even in the standard 
>> lib (assoc array). So looking at a [] and thinking "guaranteed cheap", is 
>> incorrect, period. If most people think 2+2=5, you're not going to 
>> redesign arithmetic to work around that mistaken assumption.
>
> Your assumption is that 'get the Nth element' is the only expectation for 
> opIndex interface.  My assumption is that opIndex implies 'get an element 
> efficiently' is an important part of the interface.  We obviously 
> disagree, and as I said above, neither of us is right or wrong, strictly 
> speaking. It's a matter of what is intuitive to you.
>
> Part of the problems I see with many bad designs is the author thinks they 
> see a fit for an interface, but it's not quite there.  They are so excited 
> about fitting into an interface that they forget the importance of leaving 
> out elements of the interface that don't make sense.  To me this is one of 
> them.  An interface is a fit IMO if it fits exactly.  If you have to do 
> things like implement functions that throw exceptions because they don't 
> belong, or break the contract that the interface specifies, then either 
> the interface is too specific, or you are not implementing the correct 
> interface.
>

(From the above "Addressed below..."'s)

I fully agree that leaving the wrong things out of an interface is just as 
important as putting the right things in. But I don't think that's 
applicable here.

An array can do anything a linked list can do (even insert). A linked list 
can do anything an array can do (even sort). They are both capable of the 
same exact set of basic operations: insert, delete, get at position, get 
position of, append, iterate, etc). The only thing that ever differs is how 
well each type of collection scales on each of those basic operations. The 
*whole point* of having both arrays and linked lists is that they provide 
different performance tradeoffs, not that they "implement different 
interfaces", because obviously they're all capable of doing the same things. 
It's the performance tradeoffs that are the whole point of "array vs linked 
list". But it's rarely as simple as just looking at the basic operations 
individually...

Its rare that a collection would ever be used for just one basic operation. 
What's the point sorting a collection if you're never going to insert 
anything into it? What's the point of inserting data if you're never going 
to retrieve any? In most cases, you're going to be doing multiple types of 
operations on the collection, therefore the choice of collection becomes 
"Which set of tradeoffs are the most worthwhile for my overall usage 
patters?"

You can speculate and analyze all you want about the usage patterns and the 
appropriate tradeoffs, and that's good, you certainly should. But it 
ultimately comes down to the real word tests: profiling. And if you're 
profiling, you're going to want to compare the performance of different 
types of collections. And if you're going to do that, why should you prevent 
yourself from making it a one-line change ("Vector myBunchOfStuff" <-> "List 
myBunchOfStuff"), just because the fear of someone using an array for an 
insert-intensive purpose, or a list for a random-access-intensive purpose, 
drove you to design your code in a way that forces a single change of type 
to (in many cases) be an all-out refactoring - and it'll be the type of 
refactoring that no automatic refactoring tool is going to do for you.

And suppose you do successfully find that optimal container, through your 
method or mine. Then a program feature/requirement is changed/added/removed, 
and all of a sudden, the usage patterns have changed! Now you get to do it 
all again! Major refactor then profile or change a line then profile?

You're looking at guaranteeing the performance of very narrow slices of a 
program. I'll agree that can be useful in some cases (hence, my proposal for 
how to implement performance guarantees). But in many cases, that's 
effectively a "taken out of context" fallacy and can lead to trouble.

>>>> If you've got a linked list, and you want to get element N, are you 
>>>> *really* going to go reaching for a function named "search"? How often 
>>>> do you really see a generic function named "search" or "find" that 
>>>> takes a numeric index as a the "to be found" parameter instead of 
>>>> something to be matched against the element's value? I would argue that 
>>>> that would be confusing for most people. Like I said in a different 
>>>> post farther down, the implementation of a "getAtIndex()" is obviously 
>>>> going to work like a search, but from "outside the box", what you're 
>>>> asking for is not the same.
>>>
>>> If you are indexing into a tree, it is considered a binary search, if 
>>> you are indexing into a hash, it is a search at some point to deal with 
>>> collisions.  People don't think about indexing as being a search, but in 
>>> reality it is.  A really fast search.
>>>
>>
>> It's implemented as a search, but I'd argue that the input/output 
>> specifications are different. And yes, I suppose that does put it into a 
>> bit of a grey area. But I wouldn't go so far as to say that, to the 
>> caller, it's the same thing, because there are differences. If you want 
>> get an element based on it's position in the collection, you call one 
>> function. If you want to get an element based on it's content instead of 
>> it's position, that's another function. If you want to get the position 
>> of an element based on it's content or it's identity, that's one or two 
>> more functions (depending, of course, if the element is a value type or 
>> reference type, respectively).
>
> I disagree.  I view the numeric index of an ordered container as a 'key' 
> into the container.  A keyed container has the ability to look up elements 
> quickly with the key.
>
> Take a quick look at dcollections' ArrayList.  It implements the Keyed 
> interface, with uint as the key.  I have no key for LinkList, because I 
> don't see a useful key.
>
>>> And I don't think search would be the name of the member function, it 
>>> should be something like 'getNth', which returns a cursor that points to 
>>> the element.
>>>
>>
>> Right, and outside of pure C, [] is the shorthand for and the 
>> standardized name for "getNth". If someone automatically assumes [] to be 
>> a simple lookup, chances are they're going to make the same assumption 
>> about anything named along the lines of "getNth". After all, that's what 
>> [] does, it gets the Nth.
>
> I view [] as "getByIndex", index being a value that offers quick access to 
> elements.  There is no implied 'get the nth element'.  Look at an 
> associative array.  If I had a string[string] array, what would you expect 
> to get if you passed an integer as the index?
>

You misunderstand. I'm well aware of the sequentially-indexed array vs 
associative array issues. I was just using "sequentially-indexed array" 
terminology to avoid cluttering the explanations with more general terms 
that would have distracted from bigger points. By "getNth", what I was 
getting at was "getByPosition". Maybe I should have been saying 
"getByPosition" from the start, my mistake. As you can see, I still consider 
the key of an associative array to be it's position. I'll explain why:

An associative array is the dynamic/runtime equivalent of a 
static/compiletime named variable (After all, in many dynamic languages, 
like PHP (not that I like PHP), named variables literally are keys into an 
implicit associative array). In a typical static or dynamic language, all 
variables are essentially made up of two parts: The raw data and a label. 
The label, obviously, is what's used to refer to the data. The label can be 
one of two things, an identifier or (in a non-sandboxed language) a 
dereferenced memory address.

So, borrowing the usual pointer metaphor of "memory as a series of labeled 
boxes", we can have the data "7" in the 0xA04D6'th "box" which is also 
labeled with the identifier "myInt". The memory address, obviously, is the 
position of the data. The identifier is another way to to refer the same 
position. "CPU: Where should I put this 7?" "High-level Code: In the 
location labeled with the identifier myInt".

The data of a variable corresponds to an element of any collection (array, 
assoc array, list). The memory addresses not only correspond to, but 
literally are sequential indicies into the array of addressable memory (ie, 
the key/position in a sequentially-indexed array). The identifier 
corresponds to the key of an associative array or other such collection. 
"CPU: Where, within the assoc array, should I put this 7?" "High-level Code: 
In the assoc array's box/element labeled myInt"

(With a linked list, of course, there's nothing that corresponds to the key 
of an assoc array, but it does have a natural sequential order.)

Maybe I can explain the "sorting" distinction I see a little bit better with 
our terminology hopefully now in closer sync: For any collection, each 
element has a concept of position (index/key/nth/whatever) and a concept of 
data. A collection is a series of "boxes". On the outside of each box is a 
label (position/index/key/nth/whatever). On the inside of each box is data. 
If the collection's base type is a reference type, then this "inside data" 
is, of course, a pointer/reference to more data somewhere else. There are 
two basic conceptual operations: "outside label -> inside data", and "inside 
data -> outside label".

The "inside data -> outside label" is always a search (although if the 
inside data contains a cached copy of it's outside label, then that's 
somewhat of a grey area. Personally, I would count it as a "cached search": 
usable just like a search, but faster).

The "outside label -> inside data" is, of course, our disputed 
"getAtPosition". In a linked list, it's a grey area similar to hat I called 
a "cached search" above. It's usable like an ordinary "getAtPosition", but 
slower. Sure, the implementation is done via a search algoritm, but if you 
call it a search that means that for a linked list, "getAtPosition" and 
search are the same thing (for whatever that implies, I don't have time to 
go any further on that ATM, so take it as you will).

I do understand though, that you're defining "index" and "search" 
essentially as "fast" and "slow" versions (respectively) of "X" -> "Y" 
regardless of which of X or Y is "outside label" and which is "inside data". 
Personally, I find that awkward and somewhat less useful since that means 
"index" and "search" each have multiple "input vs. output" behaviors (Ie, 
there's still the question of "Am I giving the outside position and getting 
the inside data, or vice versa?").