Why Strings as Classes?
Benji Smith
dlanguage at benjismith.net
Tue Aug 26 02:53:39 PDT 2008
superdan wrote:
>> For starters, with strings implemented as character arrays, writing
>> library code that accepts and operates on strings is a bit of a pain in
>> the neck, since you always have to write templates and template code is
>> slightly less readable than non-template code. You can't distribute
your
>> code as a DLL or a shared object, because the template instantiations
>> won't be included (unless you create wrapper functions with explicit
>> template instantiations, bloating your code size, but more importantly
>> tripling the number of functions in your API).
>
> so u mean with a class the encoding char/wchar/dchar won't be an
issue anymore. that would be hidden behind the wraps. cool.
>
> problem is that means there's an indirection cost for every character
access. oops. so then apps that decided to use some particular encoding
consistently must pay a price for stuff they don't use.
So, I was thinking about the actual costs involved with the String class
and CharSequence interface design that I'd like to see (and that exists
in languages like Java and C#).
There's the cost of the class wrapper itself, the cost of internally
representing and converting between encodings, the cost of routing all
method calls through an interface vtable. Characters, if always
represented using two bytes, would consume twice the memory. And
returning characters from method-calls has got to be slower than
accessing them directly from arrays. Right?
So I wrote some tests, in Java and in D/Tango.
The source code files are attached. Both of the tests perform a common
set of string operations (searching, splitting, concatenating, and
character-iterating). I tried to make the functionality as identical as
possible, though I wasn't sure which technique to use for splitting text
in Tango, so I used both the "Util.split" and "Util.delimit" functions.
I ran both tests using a 5MB text file, "The Complete Works of William
Shakespeare", from the Project Gutenberg website:
http://www.gutenberg.org/dirs/etext94/shaks12.txt
You can grab it for yourself, or you can just run the code against your
favorite large text file.
I compiled and ran the Java code in the 1.6.0_06 JDK, with the "-server"
flag. The d code was compiled with DMD 1.034 and Tango 0.99.7, using the
"-O -release -inline" flags.
My test machine is an AMD Turion 64 X2 dual-core laptop, with 2GB of RAM
and running WinXP SP3.
I ran the tests eight times each, using fine-resolution timers. These
are the median results:
LOADING THE FILE INTO A STRING: D/Tango wins, by 428%
D/Tango: 0.02960 seconds
Java: 0.12675 seconds
ITERATING OVER CHARS IN A STRING: Java wins, by 280%
D/Tango: 0.10093 seconds
Java: 0.03599 seconds
SEARCHING FOR A SUBSTRING: D/Tango wins, by 218%
D/Tango: 0.02251 seconds
Java: 0.04915 seconds
SEARCH & REPLACE INTO A NEW STRING: D/Tango wins, by 226%
D/Tango: 0.17685 seconds
Java: 0.39996 seconds
SPLIT A STRING ON WHITESPACE:
Java wins, by 681% (against tango.text.Util.delimit())
Java wins, by 313% (against tango.text.Util.split())
D/Tango (delimit): 8.28195 seconds
D/Tango (split): 3.80465 seconds
Java (split): 1.21477 seconds
CONCATENATING STRINGS: Java wins, by 884%
D/Tango (array concat, no pre-alloc): 4.07929 seconds
Java (StringBuilder, no pre-alloc): 0.46150 seconds
SORT STRINGS (CASE-INSENSITIVE): D/Tango wins, by 226%
D/Tango: 1.62227 seconds
Java: 3.66389 seconds
It looks like D mostly falls down when it has to allocate a lot of
memory, even if it's just allocating slices. The D performance for
string splitting really surprised me.
I was interested to see, though, that Java was so much faster at
iterating through the characters in a string, since I used the charAt(i)
method of the CharSequence interface, rather than directly iterating
through a char[] array, or even calling the charAt method on the String
instance.
And yet, character iteration is almost 3 times as fast as in D.
Down with premature optimization! Design the best interfaces possible,
to enable the most pleasant and flexible programing idioms. The
performance problems can be solved. :-P
--benji
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: StringTest.java
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20080826/54571045/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stringtest.d
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20080826/54571045/attachment-0001.ksh>
More information about the Digitalmars-d
mailing list