Const, invariant, strings and "constant data need never be copied"

Thu Nov 1 11:21:50 PDT 2007

In DMD 2.006, the definition of string was changed from const(char)[] to 
invariant(char)[] (and similarly wstring and dstring).  This change has no 
doubt broken a fair amount of D 2.x code.  While at it, I've found that the 
functions in std.string use a mix of char[] and string, but only one uses 
const(char)[].

It's worth thinking about what's actually best, from both coding and runtime 
efficiency POVs.

Let's first look at string manipulation functions such as those in 
std.string.  These merely look at a passed-in string and return something - 
they don't keep the string for later.  They therefore need only a read-only 
view of a string - they don't need to know that the string is never going to 
change.  Declaring these with invariant parameters therefore means that it 
is often necessary to .idup a string just to pass it to one of these 
functions.  Moreover, if a piece of code manipulates strings with a mixture 
of direct modification and calls to std.string functions, it necessitates 
quite a bit of copying of strings.

Consider, for example, a program that reads in a text file, normalises the 
line breaks and then ROT13 encodes the result.  Under D 1.x, this works:

----------
import std.file, std.string, std.cstream;

void main(string[] a) {
    char[] text = cast(char[]) read(a[1]);

    text = text.replace("\r\n", "\n").replace("\r", "\n");

    foreach (ref char c; text) {
        if ((c >= 'A' && c <= 'M') || (c >= 'a' && c <= 'm')) {
            c += 13;
        } else if ((c >= 'N' && c <= 'Z') || (c >= 'n' && c <= 'z')) {
            c -= 13;
        }
    }

    dout.writeString(text);
}
----------

Note that the text is never copied after it is read in, except by 
std.string.replace if it actually makes any change.  In D 2.x, it's 
necessary to change one line, to something like

    text = text.idup.replace("\r\n", "\n").replace("\r", "\n").dup;

which consequently adds two copy operations.

There are a few caveats to this example:
- the .idup is only because std.file.read currently returns a mutable 
void[] - we could actually cast it to an invariant as nothing else is going 
to use it
- there's no significant reason to normalise the line breaks before, rather 
than after, ROT13ing it
- if all we're going to do is output it, we could output on the fly rather 
than trying to modify the text in memory

but these won't be true in the more general case.  There are probably plenty 
of more involved examples in which there's more difference than this between 
the 1.x and 2.x code.

A fairly recent discussion
http://tinyurl.com/2kgpqg
touched on the question of whether functions that generate a string and 
return it should return mutable or immutable references.  And the discussion 
was by no means conclusive.

If a public library function returns a mutable array reference, it can mean 
either:
(a) it is giving up ownership of the memory the array occupies
(b) it is giving the caller direct access to data it holds for some purpose 
(std.mmfile is an example of this)

In case (a), if there's no risk of there being other references to the same 
memory (as is the case if the function always allocates it) then it would 
make sense to give the caller the choice of whether it should be mutable. 
Indeed, the choice is already there, but in no way that protects against 
inadvertently trying it on something of case (b).

Unfortunately, constness doesn't play well with copy-on-write.  Needless to 
reiterate, std.string's functions want at least a read-only view of the 
array.  But the caller might still want what's returned to be mutable.  If 
the function is going to return the passed-in string, it can only (sensibly) 
return a read-only view since that's what it received.  While the caller 
could try the D&D trick of casting away the constness, there is a risk of 
catastrophic failure if some std.string implementation caches strings for 
reuse.

But invariant clearly has some use.  It enables the claim that "constant 
data need never be copied" to work, as long as invariant is used well.  So 
if something receives a string from the caller and wants to save it for 
later use/retrieval, declaring the parameter as invariant will mean that the 
callee won't need to copy the data.  So there's the benefit that, by making 
it the caller's responsibility to .idup the data if necessary, it'll save 
the overhead of unnecessary copying.  Similarly, when something later wants 
to retrieve the data, if it is invariant then there's no need to copy it.  A 
library and an application that uses it can share one copy of the data, and 
believe that the data will never change.

To round things up, the D 2.x const/final/invariant thing is certainly 
useful, but not perfect.  The different storage classes/type modifiers are 
good for different things.  It might be worth a good think about how Phobos 
uses them (or in some cases doesn't), and what is best practically for the 
definitions of string/wstring/dstring.  (Was there a discussion I missed?)

Of course, it would also be worth a good think about whether anything can be 
added or changed in the language to improve matters.  Ideas that come to my 
mind are:

1. A property .invariant, which just returns the reference as an invariant 
if it's already invariant, otherwise does the same as .idup.  For this to 
work, the runtime would have to keep a record of whether each piece of 
allocated memory is invariant, which would interfere with the current 
ability to cast invariance in or out - but at what cost?

2. Some type modifier such as 'unique' that would indicate that only one 
reference to the data exists.  I'm not sure what rules there should be to 
enforce this, or if we should just go on trust.  But it would be implicitly 
convertible to mutable, const or invariant, enabling something like
----------
unique(int)[] rep(int i) {
    unique(int)[] result;
    result.length = i % 10;
    result[] = i;
    return result;
}

int[] twos = rep(2);
const(int) fives = rep(5);
invariant(int)[] twelves = rep(12);
----------

3. Some concept of const-transparent functions.  One approach would be to 
enable a type modifier (or the lack thereof) to be used as a template 
parameter, with something like
----------
T(char)[] doSomethingWith(typemod T)(T(char)[] param) {
    ...
}

char[] str;
const(char)[] cstr;
invariant(char)[] istr;

str = doSomethingWith(str);
cstr = doSomethingWith(cstr);
istr = doSomethingWith(istr);
----------

This would enable copy on write to work well.  As long as nothing _within_ 
the function so templated relies on the distinction, the compiler could 
optimise by generating only one instance, since it affects only compile-time 
type checking and not code generation.

Comments?

Stewart.

-- 
My e-mail address is valid but not my primary mailbox.  Please keep replies
on the 'group where everybody may benefit.