Why Strings as Classes?

Benji Smith dlanguage at benjismith.net
Tue Aug 26 02:53:39 PDT 2008


superdan wrote:
 >> For starters, with strings implemented as character arrays, writing
 >> library code that accepts and operates on strings is a bit of a pain in
 >> the neck, since you always have to write templates and template code is
 >> slightly less readable than non-template code. You can't distribute 
your
 >> code as a DLL or a shared object, because the template instantiations
 >> won't be included (unless you create wrapper functions with explicit
 >> template instantiations, bloating your code size, but more importantly
 >> tripling the number of functions in your API).
 >
 > so u mean with a class the encoding char/wchar/dchar won't be an 
issue anymore. that would be hidden behind the wraps. cool.
 >
 > problem is that means there's an indirection cost for every character 
access. oops. so then apps that decided to use some particular encoding 
consistently must pay a price for stuff they don't use.

So, I was thinking about the actual costs involved with the String class 
and CharSequence interface design that I'd like to see (and that exists 
in languages like Java and C#).

There's the cost of the class wrapper itself, the cost of internally 
representing and converting between encodings, the cost of routing all 
method calls through an interface vtable. Characters, if always 
represented using two bytes, would consume twice the memory. And 
returning characters from method-calls has got to be slower than 
accessing them directly from arrays. Right?

So I wrote some tests, in Java and in D/Tango.

The source code files are attached. Both of the tests perform a common 
set of string operations (searching, splitting, concatenating, and 
character-iterating). I tried to make the functionality as identical as 
possible, though I wasn't sure which technique to use for splitting text 
in Tango, so I used both the "Util.split" and "Util.delimit" functions.

I ran both tests using a 5MB text file, "The Complete Works of William 
Shakespeare", from the Project Gutenberg website:

http://www.gutenberg.org/dirs/etext94/shaks12.txt

You can grab it for yourself, or you can just run the code against your 
favorite large text file.

I compiled and ran the Java code in the 1.6.0_06 JDK, with the "-server" 
flag. The d code was compiled with DMD 1.034 and Tango 0.99.7, using the 
"-O -release -inline" flags.

My test machine is an AMD Turion 64 X2 dual-core laptop, with 2GB of RAM 
and running WinXP SP3.

I ran the tests eight times each, using fine-resolution timers. These 
are the median results:

LOADING THE FILE INTO A STRING:   D/Tango wins, by 428%
    D/Tango: 0.02960 seconds
    Java:    0.12675 seconds

ITERATING OVER CHARS IN A STRING:   Java wins, by 280%
    D/Tango:  0.10093 seconds
    Java:     0.03599 seconds

SEARCHING FOR A SUBSTRING:   D/Tango wins, by 218%
    D/Tango:  0.02251 seconds
    Java:     0.04915 seconds

SEARCH & REPLACE INTO A NEW STRING:   D/Tango wins, by 226%
    D/Tango:  0.17685 seconds
    Java:     0.39996 seconds

SPLIT A STRING ON WHITESPACE:
       Java wins, by 681% (against tango.text.Util.delimit())
       Java wins, by 313%  (against tango.text.Util.split())
    D/Tango (delimit): 8.28195 seconds
    D/Tango (split):   3.80465 seconds
    Java (split):      1.21477 seconds

CONCATENATING STRINGS:   Java wins, by 884%
    D/Tango (array concat, no pre-alloc):  4.07929 seconds
    Java (StringBuilder, no pre-alloc):    0.46150 seconds

SORT STRINGS (CASE-INSENSITIVE):   D/Tango wins, by 226%
    D/Tango:  1.62227 seconds
    Java:     3.66389 seconds

It looks like D mostly falls down when it has to allocate a lot of 
memory, even if it's just allocating slices. The D performance for 
string splitting really surprised me.

I was interested to see, though, that Java was so much faster at 
iterating through the characters in a string, since I used the charAt(i) 
method of the CharSequence interface, rather than directly iterating 
through a char[] array, or even calling the charAt method on the String 
instance.

And yet, character iteration is almost 3 times as fast as in D.

Down with premature optimization! Design the best interfaces possible, 
to enable the most pleasant and flexible programing idioms. The 
performance problems can be solved. :-P

--benji
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: StringTest.java
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20080826/54571045/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stringtest.d
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20080826/54571045/attachment-0001.ksh>


More information about the Digitalmars-d mailing list