Why Strings as Classes?
Benji Smith
dlanguage at benjismith.net
Mon Aug 25 13:27:26 PDT 2008
In another thread (about array append performance) I mentioned that
Strings ought to be implemented as classes rather than as simple builtin
arrays. Superdan asked why. Here's my response...
I'll start with a few of the softball, easy reasons.
For starters, with strings implemented as character arrays, writing
library code that accepts and operates on strings is a bit of a pain in
the neck, since you always have to write templates and template code is
slightly less readable than non-template code. You can't distribute your
code as a DLL or a shared object, because the template instantiations
won't be included (unless you create wrapper functions with explicit
template instantiations, bloating your code size, but more importantly
tripling the number of functions in your API).
Another good low-hanging argument is that strings are frequently used as
keys in associative arrays. Every insertion and retrieval in an
associative array requires a hashcode computation. And since D strings
are just dumb arrays, they have no way of memoizing their hashcodes.
We've already observed that D assoc arrays are less performant than even
Python maps, so the extra cost of lookup operations is unwelcome.
But much more important than either of those reasons is the lack of
polymorphism on character arrays. Arrays can't have subclasses, and they
can't implement interfaces.
A good example of what I'm talking about can be seen in the Phobos and
Tango regular expression engines. At least the Tango implementation
matches against all string types (the Phobos one only works with char[]
strings).
But what if I want to consume a 100 MB logfile, counting all lines that
match a pattern?
Right now, to use the either regex engine, I have to read the entire
logfile into an enormous array before invoking the regex search function.
Instead, what if there was a CharacterStream interface? And what if all
the text-handling code in Phobos & Tango was written to consume and
return instances of that interface?
A regex engine accepting a CharacterStream interface could process text
from string literals, file input streams, socket input streams, database
records, etc, etc, etc... without having to pollute the API with a bunch
of casts, copies, and conversions. And my logfile processing application
would consume only a tiny fraction of the memory needed by the character
array implementation.
Most importantly, the contract between the regex engine and its
consumers would provide a well-defined interface for processing text,
regardless of the source or representation of that text.
Along a similar vein, I've worked on a lot of parsers over the past few
years, for domain specific languages and templating engines, and stuff
like that. Sometimes it'd be very handy to define a "Token" class that
behaves exactly like a String, but with some additional behavior.
Ideally, I'd like to implement that Token class as an implementor of the
CharacterStream interface, so that it can be passed directly into other
text-handling functions.
But, in D, with no polymorphic text handling, I can't do that.
As one final thought... I suspect that mutable/const/invariant string
handling would be much more conveniently implemented with a
MutableCharacterStream interface (as an extended interface of
CharacterStream).
Any function written to accept a CharacterStream would automatically
accept a MutableCharacterStream, thanks to interface polymorphism,
without any casts, conversions, or copies. And various implementors of
the interface could provide buffered implementations operating on
in-memory strings, file data, or network data.
Coding against the CharacterStream interface, library authors wouldn't
need to worry about const-correctness, since the interface wouldn't
provide any mutator methods.
But then again, I haven't used any of the const functionality in D2, so
I can't actually comment on relative usability of compiler-enforced
immutability versus interface-enforced immutability.
Anyhow, those are some of my thoughts... I think there are a lot of
compelling reasons for de-coupling the specification of string handling
functionality from the implementation of that functionality, primarily
for enabling polymorphic text-processing.
But memoized hashcodes would be cool too :-)
--benji
More information about the Digitalmars-d
mailing list