Why Strings as Classes?

Mon Aug 25 16:35:42 PDT 2008

Benji Smith Wrote:

> In another thread (about array append performance) I mentioned that 
> Strings ought to be implemented as classes rather than as simple builtin
> arrays. Superdan asked why. Here's my response...

well then allow me to retort.

> I'll start with a few of the softball, easy reasons.
> 
> For starters, with strings implemented as character arrays, writing 
> library code that accepts and operates on strings is a bit of a pain in 
> the neck, since you always have to write templates and template code is 
> slightly less readable than non-template code. You can't distribute your 
> code as a DLL or a shared object, because the template instantiations 
> won't be included (unless you create wrapper functions with explicit 
> template instantiations, bloating your code size, but more importantly 
> tripling the number of functions in your API).

so u mean with a class the encoding char/wchar/dchar won't be an issue anymore. that would be hidden behind the wraps. cool.

problem is that means there's an indirection cost for every character access. oops. so then apps that decided to use some particular encoding consistently must pay a price for stuff they don't use.

but if u have strings like today it's a no-brainer to define a class that does all that stuff. u can then use that class whenever you feel. it would be madness to put that class in the language definition. at best it's a candidate for the stdlib.

so that low-hangin' argument of yers ain't that low-hangin' after all. unless u call a hanged deadman low-hangin'.

> Another good low-hanging argument is that strings are frequently used as 
> keys in associative arrays. Every insertion and retrieval in an 
> associative array requires a hashcode computation. And since D strings 
> are just dumb arrays, they have no way of memoizing their hashcodes. 
> We've already observed that D assoc arrays are less performant than even 
> Python maps, so the extra cost of lookup operations is unwelcome.

again you want to build larger components from smaller components. you can build a string with memoized hashcode from a string without memoized hashcode. but you can't build a string without memoized hashcode from a string with memoized hashcode. but wait there's more. the extra field is paid for regardless. so what numbers do you have to back up your assertion that it's worth paying that cost for everything except hashtables.

> But much more important than either of those reasons is the lack of 
> polymorphism on character arrays. Arrays can't have subclasses, and they 
> can't implement interfaces.

that's why you can always define a class that does all those good things.

by the same arg why isn't int a class. the point is you can always create class Int that does what an int does, slower but more flexible. if all you had was class Int you'd be in slowland.

> A good example of what I'm talking about can be seen in the Phobos and 
> Tango regular expression engines. At least the Tango implementation 
> matches against all string types (the Phobos one only works with char[] 
> strings).
> 
> But what if I want to consume a 100 MB logfile, counting all lines that 
> match a pattern?
>
> Right now, to use the either regex engine, I have to read the entire 
> logfile into an enormous array before invoking the regex search function.
> 
> Instead, what if there was a CharacterStream interface? And what if all 
> the text-handling code in Phobos & Tango was written to consume and 
> return instances of that interface?

what exactly is the problem there aside from a library issue.

> A regex engine accepting a CharacterStream interface could process text 
> from string literals, file input streams, socket input streams, database 
> records, etc, etc, etc... without having to pollute the API with a bunch 
> of casts, copies, and conversions. And my logfile processing application 
> would consume only a tiny fraction of the memory needed by the character 
> array implementation.

library problem. or maybe you want to build character stream into the language too.

> Most importantly, the contract between the regex engine and its 
> consumers would provide a well-defined interface for processing text, 
> regardless of the source or representation of that text.
> 
> Along a similar vein, I've worked on a lot of parsers over the past few 
> years, for domain specific languages and templating engines, and stuff 
> like that. Sometimes it'd be very handy to define a "Token" class that 
> behaves exactly like a String, but with some additional behavior. 
> Ideally, I'd like to implement that Token class as an implementor of the 
> CharacterStream interface, so that it can be passed directly into other 
> text-handling functions.
> 
> But, in D, with no polymorphic text handling, I can't do that.

of course you can. you just don't want to for the sake of building a fragile argument.

> As one final thought... I suspect that mutable/const/invariant string 
> handling would be much more conveniently implemented with a 
> MutableCharacterStream interface (as an extended interface of 
> CharacterStream).
> 
> Any function written to accept a CharacterStream would automatically 
> accept a MutableCharacterStream, thanks to interface polymorphism, 
> without any casts, conversions, or copies. And various implementors of 
> the interface could provide buffered implementations operating on 
> in-memory strings, file data, or network data.
> 
> Coding against the CharacterStream interface, library authors wouldn't 
> need to worry about const-correctness, since the interface wouldn't 
> provide any mutator methods.

sounds great. so then go ahead and make the characterstream thingie. the language gives u everything u need to make it clean and fast.

> But then again, I haven't used any of the const functionality in D2, so 
> I can't actually comment on relative usability of compiler-enforced 
> immutability versus interface-enforced immutability.
> 
> Anyhow, those are some of my thoughts... I think there are a lot of 
> compelling reasons for de-coupling the specification of string handling 
> functionality from the implementation of that functionality, primarily 
> for enabling polymorphic text-processing.
> 
> But memoized hashcodes would be cool too :-)

sorry dood each and every argument talks straight against your case. if i had any doubts, you just convinced me that a builtin string class would be a mistake.