Const, invariant, strings and "constant data need never be copied"
Stewart Gordon
smjg_1998 at yahoo.com
Thu Nov 1 11:21:50 PDT 2007
In DMD 2.006, the definition of string was changed from const(char)[] to
invariant(char)[] (and similarly wstring and dstring). This change has no
doubt broken a fair amount of D 2.x code. While at it, I've found that the
functions in std.string use a mix of char[] and string, but only one uses
const(char)[].
It's worth thinking about what's actually best, from both coding and runtime
efficiency POVs.
Let's first look at string manipulation functions such as those in
std.string. These merely look at a passed-in string and return something -
they don't keep the string for later. They therefore need only a read-only
view of a string - they don't need to know that the string is never going to
change. Declaring these with invariant parameters therefore means that it
is often necessary to .idup a string just to pass it to one of these
functions. Moreover, if a piece of code manipulates strings with a mixture
of direct modification and calls to std.string functions, it necessitates
quite a bit of copying of strings.
Consider, for example, a program that reads in a text file, normalises the
line breaks and then ROT13 encodes the result. Under D 1.x, this works:
----------
import std.file, std.string, std.cstream;
void main(string[] a) {
char[] text = cast(char[]) read(a[1]);
text = text.replace("\r\n", "\n").replace("\r", "\n");
foreach (ref char c; text) {
if ((c >= 'A' && c <= 'M') || (c >= 'a' && c <= 'm')) {
c += 13;
} else if ((c >= 'N' && c <= 'Z') || (c >= 'n' && c <= 'z')) {
c -= 13;
}
}
dout.writeString(text);
}
----------
Note that the text is never copied after it is read in, except by
std.string.replace if it actually makes any change. In D 2.x, it's
necessary to change one line, to something like
text = text.idup.replace("\r\n", "\n").replace("\r", "\n").dup;
which consequently adds two copy operations.
There are a few caveats to this example:
- the .idup is only because std.file.read currently returns a mutable
void[] - we could actually cast it to an invariant as nothing else is going
to use it
- there's no significant reason to normalise the line breaks before, rather
than after, ROT13ing it
- if all we're going to do is output it, we could output on the fly rather
than trying to modify the text in memory
but these won't be true in the more general case. There are probably plenty
of more involved examples in which there's more difference than this between
the 1.x and 2.x code.
A fairly recent discussion
http://tinyurl.com/2kgpqg
touched on the question of whether functions that generate a string and
return it should return mutable or immutable references. And the discussion
was by no means conclusive.
If a public library function returns a mutable array reference, it can mean
either:
(a) it is giving up ownership of the memory the array occupies
(b) it is giving the caller direct access to data it holds for some purpose
(std.mmfile is an example of this)
In case (a), if there's no risk of there being other references to the same
memory (as is the case if the function always allocates it) then it would
make sense to give the caller the choice of whether it should be mutable.
Indeed, the choice is already there, but in no way that protects against
inadvertently trying it on something of case (b).
Unfortunately, constness doesn't play well with copy-on-write. Needless to
reiterate, std.string's functions want at least a read-only view of the
array. But the caller might still want what's returned to be mutable. If
the function is going to return the passed-in string, it can only (sensibly)
return a read-only view since that's what it received. While the caller
could try the D&D trick of casting away the constness, there is a risk of
catastrophic failure if some std.string implementation caches strings for
reuse.
But invariant clearly has some use. It enables the claim that "constant
data need never be copied" to work, as long as invariant is used well. So
if something receives a string from the caller and wants to save it for
later use/retrieval, declaring the parameter as invariant will mean that the
callee won't need to copy the data. So there's the benefit that, by making
it the caller's responsibility to .idup the data if necessary, it'll save
the overhead of unnecessary copying. Similarly, when something later wants
to retrieve the data, if it is invariant then there's no need to copy it. A
library and an application that uses it can share one copy of the data, and
believe that the data will never change.
To round things up, the D 2.x const/final/invariant thing is certainly
useful, but not perfect. The different storage classes/type modifiers are
good for different things. It might be worth a good think about how Phobos
uses them (or in some cases doesn't), and what is best practically for the
definitions of string/wstring/dstring. (Was there a discussion I missed?)
Of course, it would also be worth a good think about whether anything can be
added or changed in the language to improve matters. Ideas that come to my
mind are:
1. A property .invariant, which just returns the reference as an invariant
if it's already invariant, otherwise does the same as .idup. For this to
work, the runtime would have to keep a record of whether each piece of
allocated memory is invariant, which would interfere with the current
ability to cast invariance in or out - but at what cost?
2. Some type modifier such as 'unique' that would indicate that only one
reference to the data exists. I'm not sure what rules there should be to
enforce this, or if we should just go on trust. But it would be implicitly
convertible to mutable, const or invariant, enabling something like
----------
unique(int)[] rep(int i) {
unique(int)[] result;
result.length = i % 10;
result[] = i;
return result;
}
int[] twos = rep(2);
const(int) fives = rep(5);
invariant(int)[] twelves = rep(12);
----------
3. Some concept of const-transparent functions. One approach would be to
enable a type modifier (or the lack thereof) to be used as a template
parameter, with something like
----------
T(char)[] doSomethingWith(typemod T)(T(char)[] param) {
...
}
char[] str;
const(char)[] cstr;
invariant(char)[] istr;
str = doSomethingWith(str);
cstr = doSomethingWith(cstr);
istr = doSomethingWith(istr);
----------
This would enable copy on write to work well. As long as nothing _within_
the function so templated relies on the distinction, the compiler could
optimise by generating only one instance, since it affects only compile-time
type checking and not code generation.
Comments?
Stewart.
--
My e-mail address is valid but not my primary mailbox. Please keep replies
on the 'group where everybody may benefit.
More information about the Digitalmars-d
mailing list