Why Strings as Classes?

Mon Aug 25 13:27:26 PDT 2008

In another thread (about array append performance) I mentioned that 
Strings ought to be implemented as classes rather than as simple builtin
arrays. Superdan asked why. Here's my response...

I'll start with a few of the softball, easy reasons.

For starters, with strings implemented as character arrays, writing 
library code that accepts and operates on strings is a bit of a pain in 
the neck, since you always have to write templates and template code is 
slightly less readable than non-template code. You can't distribute your 
code as a DLL or a shared object, because the template instantiations 
won't be included (unless you create wrapper functions with explicit 
template instantiations, bloating your code size, but more importantly 
tripling the number of functions in your API).

Another good low-hanging argument is that strings are frequently used as 
keys in associative arrays. Every insertion and retrieval in an 
associative array requires a hashcode computation. And since D strings 
are just dumb arrays, they have no way of memoizing their hashcodes. 
We've already observed that D assoc arrays are less performant than even 
Python maps, so the extra cost of lookup operations is unwelcome.

But much more important than either of those reasons is the lack of 
polymorphism on character arrays. Arrays can't have subclasses, and they 
can't implement interfaces.

A good example of what I'm talking about can be seen in the Phobos and 
Tango regular expression engines. At least the Tango implementation 
matches against all string types (the Phobos one only works with char[] 
strings).

But what if I want to consume a 100 MB logfile, counting all lines that 
match a pattern?

Right now, to use the either regex engine, I have to read the entire 
logfile into an enormous array before invoking the regex search function.

Instead, what if there was a CharacterStream interface? And what if all 
the text-handling code in Phobos & Tango was written to consume and 
return instances of that interface?

A regex engine accepting a CharacterStream interface could process text 
from string literals, file input streams, socket input streams, database 
records, etc, etc, etc... without having to pollute the API with a bunch 
of casts, copies, and conversions. And my logfile processing application 
would consume only a tiny fraction of the memory needed by the character 
array implementation.

Most importantly, the contract between the regex engine and its 
consumers would provide a well-defined interface for processing text, 
regardless of the source or representation of that text.

Along a similar vein, I've worked on a lot of parsers over the past few 
years, for domain specific languages and templating engines, and stuff 
like that. Sometimes it'd be very handy to define a "Token" class that 
behaves exactly like a String, but with some additional behavior. 
Ideally, I'd like to implement that Token class as an implementor of the 
CharacterStream interface, so that it can be passed directly into other 
text-handling functions.

But, in D, with no polymorphic text handling, I can't do that.

As one final thought... I suspect that mutable/const/invariant string 
handling would be much more conveniently implemented with a 
MutableCharacterStream interface (as an extended interface of 
CharacterStream).

Any function written to accept a CharacterStream would automatically 
accept a MutableCharacterStream, thanks to interface polymorphism, 
without any casts, conversions, or copies. And various implementors of 
the interface could provide buffered implementations operating on 
in-memory strings, file data, or network data.

Coding against the CharacterStream interface, library authors wouldn't 
need to worry about const-correctness, since the interface wouldn't 
provide any mutator methods.

But then again, I haven't used any of the const functionality in D2, so 
I can't actually comment on relative usability of compiler-enforced 
immutability versus interface-enforced immutability.

Anyhow, those are some of my thoughts... I think there are a lot of 
compelling reasons for de-coupling the specification of string handling 
functionality from the implementation of that functionality, primarily 
for enabling polymorphic text-processing.

But memoized hashcodes would be cool too :-)

--benji