C/C++ style Crashes?

Steve Horne stephenwantshornenospam100 at aol.com
Mon Jan 15 19:16:46 PST 2007


On Mon, 15 Jan 2007 17:54:57 +0100, Sebastian Biallas
<groups.5.sepp at spamgourmet.com> wrote:

>a) Where do you need extremely large lists? Lists have a search time in
>O(n). Ok, you have a O(1) insertion/deletion here, but only at one
>place, since your iterators get void after an insertion/deletion
>(contrary to normal double linked lists)

Depends what you're storing in the list. Consider a word processor
that is expected to handle large documents, for instance. You keep a
few iterators for key current positions, but almost all accesses are
sequential. Jumping to a specific page number is not, but that can
afford a delay, or else you can keep a mapping of page numbers to
start position iterators separately. A huge array of characters would
cause problems with inserts and deletes, but a list handles this
easily.

There are other approaches, but few as simple or elegant as they tend
to require a pair of data structure approaches working together (for
example a linked list of short strings).

>b) They don't work in a GC-enviroment or require special handling.

It's called opting out of garbage collection, and certainly it would
be required in this case. Kind of handy that D allows this, therefore.

>Are they really used somewhere out of the IOCCC? I'd be curious to see a
>real world example.

The obfuscated C contest is about obfuscated code style. These lists
can be implemented as cleanly and clearly as any other data structure
- the code is far simpler than, for instance, red-black balanced
binary trees (as used in the STL).

So consider the word processor example. One alternative approach would
be to use a huge vector of characters, but that wouldn't work well for
insertions and deletions - the larger the document, the longer it
takes to resize the array. Another approach is to use a list of short
arrays of characters, but then you're mixing two data structure
approaches. Issues like merging overly small arrays and splitting
overly long arrays would complicate the code, by your logic making it
a candidate for the obfuscated C contest itself.

It is possible to create a tree data structure that implements a
vector-style container, and which can implement efficient inserts and
deletes anywhere, and using a multiway tree keeps the overheads
relatively small and helps with cache performance, but of course you
won't find support for this data structure in any existing language or
standard library that I know of so we are back in
creating-a-new-core-data-structure territory, with the usual bloat
management issues, and this is a significantly more complex data
structure to implement than the list approach already described (trust
me, I've done exactly this in C++).

I've never needed to write a word processor myself (I've written text
layout code, but it was non-interactive, and I've written a text
editor, but that was back in the days of DOS 3) but if you ever need
to do so, I'd recommend that you give serious consideration to the
list approach I described. Of the options that can actually handle the
job efficiently, it just might be the most simple and elegant option
available.

>> In addition, pointer casting is useful for managing template bloat.
>> The basic approach is to write type-unsafe non-template code, and then
>> use a template to add a typesafe wrapper. 
>
>This works with normal pointer casting (no need for arithmetik). Hey,
>this was even the normal usage of Java-Containers before they had the
>template stuff.

Yes - that was explanation leading into the point, not the point
itself.

>> But in many cases, more
>> flexibility is needed than can be handled using simple pointer
>> casting.
>> 
>> For example, consider a library to handle flexible tree data
>> structures. The container being implemented may or may not have a key
>> field. It may or may not have a data field. It may or may not have
>> subtree summary data stored within nodes. The non-template part needs
>> a way to access fields such that the position within a node struct (of
>> an unknown datatype) is determined at run-time.
>
>I don't think it's a good idea to trade portability for a few cycles.
>And this might be even slower, since you confuse the compiler
>(alias-analysis etc).

It should be clear that we're not talking about everyday apps
programming here. To the extent that it would be used in an
application at all, the word processor example above is probably
relevant - it would implement a core data structure in a large
application, but most developers working on the app wouldn't touch the
internals of that data structure.

The portability issues are real but not hard to handle. For getting
the offset of a specific field, for instance, the C standard library
has an offsetof macro for precisely this kind of work. The template
wrapper simply defines the struct it needs and uses offsetof to
extract the needed offsets - technicalities such as alignment are
handled automatically. The biggest issue is making sure that offsets
are held in integers of the right size, and even that isn't much of a
hassle - if you can't find an integer-the-same-size-as-a-pointer type
in the language or library, you simply define your own using
conditional compilation or template specialisation or whatever.

As for the 'a few cycles' thing, remember that we are talking about
defining core data structures here - things that will be accessed a
great deal, often in inner loops. The performance of core data
structures is important enough that Walter makes a big deal of the few
cycles per access saved due to Ds handling of hash tables (using
binary trees to handle collisions, as opposed to other common
implementations that use linked lists), and rightly so. If you have an
application that does millions of accesses, those few cycles per
access can really start to accumulate.

In fact one of the reasons why scripting languages have their core
data structures built into the language is simply that if they were
implemented using the scripting languages themselves, the performance
would be far too poor to be practical. Having efficient operations on
core data structures is extremely important.

It is of course true that, in general, programmers shouldn't fuss
about code optimisation too much - choice of algorithm and data
structure, yes, but fiddling with implementation details definitely
not. Key operations on core data structures are often an exception to
this rule, however.

And yes, there is a real chance that doing this may be slower for some
operations - very much slower. That is why I said...

: The most efficient way to handle this depends on what kind of
: operation is being implemented. A single data structure may have two
: or more ways to access the same field, just to ensure that all
: operations can be handled efficiently - many accesses will probably be
: done as part of small functions supplied by the typesafe wrapper, to
: keep the call overheads down in inner loops.

... and also why I said these techniques are needed to manage bloat
(rather than completely eliminate it). The compiler can handle these
small inner-loop functions very well, but they create bloat since each
instantiation of the template wrapper creates a new set of these small
functions. Therefore, there is an issue of balance.

So it is absolutely true that these are not things to mess with every
day. They require care, and for most things you are far better off
sticking with the containers supplied by the language itself or its
standard libraries. But there are exceptions to this rule, and
depending on the kind of code that you tend to work on, these
exceptions can happen more often than you might realise. And in any
case, someone has to write the standard library containers in the
first place.

-- 
Remove 'wants' and 'nospam' from e-mail.



More information about the Digitalmars-d mailing list