Is all this Invarient **** er... stuff, premature optimisation?

Mon Apr 28 17:04:06 PDT 2008

Walter Bright wrote:

>p9e883002 at sneakemail.com wrote:

>>Did I suggest this was an optimisation?
>
>You bring up a good point.

Sorry to have provoked you Walter, but thanks for your reply.

>On a tiny example such as yours, where you can see everything that is 
>going on at a glance, such as where strings come from and where they are 
>going, there isn't any point to immutable strings. You're right about that.

Well obviously the example was trivial to concentrate attention upon the 
issue I was having.

>  It's real easy to lose track of who owns a string, who else has references to the string, who has rights to change the string and who doesn't.

The keyword in there is "who". The problem is that you are pessimising the 
entire language, once rightly famed for it's performance, for *all* users. 
For the notional convenience of those few writing threaded applications. 
Now don't go taking that the wrong way. In other circles, I am known as 
"Mr. Threading". At least for my advocacy of them, if not my expertise. 
Though I have been using threads for a relatively long time, going way 
back to pre-1.0 OS/2 (then known internally as CP/DOS). Only mentioned to 
show I'm not in the "thread is spelt f-o-r-k" camp.

>For example, you're changing the char[][] passed in to main(). What if one 
>of those strings is a literal in the read-only data section?

Okay. So that begs the question of how does runtime external data end up 
in a read-only data section? Of course, it can be done, but that then begs 
the question: why? But let's ignore that for now and concentrate on the 
development on my application that wants to mutate one or more of those 
strings.

The first time I try to mutate one, I'm going to hit an error, either 
compile time or runtime, and immediately know, assuming the error message 
is reasonably understandable, that I need to make a copy of the immutable 
to string into something I can mutate. A quick, *single* dup, and I'm away 
and running.

Provided that I have the tools to do what I need that is. In this case, 
and the entire point of the original post, that means a library of common 
string manipulation functions that work on my good old fashioned char[]s 
without my needing jump through the hoops of neo-orthodoxy to use them.

But, as I tried to point out in the post to which you replied, the whole 
'args' thing is a red herring. It was simply a convenient source of 
non-compile-time data. I couldn't get the std.stream example to compile. 
Apparently due to a bug in the v2 libraries--see elsewhere.

In this particular case, I turned to D in order to manipulate 125,000,000 
x 500 to 2000 byte strings. A dump of a inverted index DB. I usually do 
this kinda stuff in a popular scripting language, but that proved to be 
rather too slow for this volume of data. Each of those records needs to go 
through multiple mutations. From uppercasing of certain fields; the 
complete removal of certain characters within substantial subsets of each 
record; to the recalculation and adjustment of an embedded hex digest 
within each record to reflect the preceding changes. All told, each record 
my go through anything from 5 to 300 separate mutations.

Doing this via immutable buffers is going to create scads and scads of 
short-lived, immutable sub-elements that will just tax the GC to hell and 
impose unnecessary and unacceptable time penalties on the process. And I 
almost certainly will have to go through the process many times before I 
get the data in the ultimate form I need.

>So what happens is code starts defensively making copies of the string 
>"just in case." I'll argue that in a complex program, you'll actually wind 
>up making far more copies than you will with invariant strings.
>[from another post] I bet that, though, after a while they'll evolve to 
>eschew it in favor of immutable strings. It's easier than arguing about it

You are so wrong here. I spent 2 of the worst years of my coding career 
working in Java, and ended up fighting it all the way. Whilst some of that 
was due to their sudden re-invention of major parts of the system 
libraries in completely incompatible ways when the transition from (from 
memory) 1.2 to 1.3 occurred--and being forced to make the change because 
of the near total abandonment of support or bug fixing for the 'old 
libraries'. Another big part of the problem was the endless complexities 
involved in switching between the String type and the StringBuffer type.

Please learn from history. Talk to (experienced) Java programmers. I mean 
real working stiffs, not OO-purists from academia. Preferably some that 
have experience of other languages also. It took until v1.5 before the 
performance of Java--and the dreaded GC pregnant pause--finally reached a 
point where Java performance for manipulating large datasets was both 
reasonable, and more importantly, reasonably deterministic. Don't make 
their mistakes over.

Too many times in the last thirty years I've seen promising, pragmatic 
software technologies tail off into academic obscurity because th primary 
motivators suddenly "got religion". Whether OO dogma or functional purity 
or whatever other flavour of neo-orthodoxy became flavour de jour, The 
assumption that "they'll see the light eventually" has been the downfall 
of many a promising start.

Just as the answer to the occasional hit-and-run death is not banning 
cars, so fixing unintentional aliasing in threaded applications does not 
lie in forcing all character arrays to be immutable.

For one reason, it doesn't stop there. Character arrays, are just arrays 
of numbers. Exactly the same problems arise with arrays of integers, 
reals, associative arrays. etc. Imagine the costs of duplicating an entire 
hash every time you add a new key or alter a value. The penalties grow 
exponentially with the size of the hash (array of ints, longs, reals ...).

And before you reject this notion on the basis that "I'd never do that", 
what's the difference? Are strings any more vulnerable to the problems 
invariance is meant to tackle that these other datatypes?

Try manipulating large datasets--images, DNA data, signal processing, 
finite element analysis; any of the types of applications for which 
multi-threading isn't just a way allow the program to do something useful 
while the user decides which button to click--in any of the "referentially 
transparent" languages that are concurrency capable and see the hoops you 
have to leap through to achieve anything like descent performance. Eg. 
Haskell Unsafe* library routines (Basically, abandon referential 
transparency for this data so that we can get something done in a 
reasonable time frame!). Look for "If you can match 1-core C speed using 
4-core Haskell parallelism without "unsafe pseudo-C in Haskell" trickery, 
I will be impressed. ..." in the following article:   
http://reddit.com/r/programming/info/61p6f/comments/

The abandonment or deprecation of lvalue slices on string types is the 
thin end of the wedge toward referential transparency and despite all the 
academic hype and impressive (small scale) demos of the 'match made in 
heaven' that is 'referential transparency & concurrency', try to seek out 
real-world examples of the combination running in real-world environments. 
Ie. Where someone other than the tax-payer of whatever country is paying 
for the development, and the time pressure to obtain the results are a 
little more demanding than Thesis submission date and you'll find them 
very conspicuous by their absence.

Such ideas look great on paper, in the heady world of ideal Turing 
Machines with unlimited length tapes (unbounded memory). But once you 
bring them back to the real world of finite RAM, fragmentable heaps and 
GC, they becomes impractical. Unworkable for real data sets in real time.

Don't feel the need to argue this on-forum. If it hasn't persuaded you 
that forcing invariance upon one datatype, through providing a string 
library that only work with invariant strings, will do little to address 
the problems it attempts to solve, then I doubt further discussion will. 
Please return to the pragmatism that so stood out in your early visions 
for D and abandon this folly before, as with so many of the follies of the 
gentleman academic of yore, it becomes a life-long quest ending up as a 
memorial or tombstone.

Cheers, b.
--