Top 5
Denis Koroskin
2korden at gmail.com
Sat Oct 11 07:17:23 PDT 2008
On Sat, 11 Oct 2008 18:00:38 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
> Sergey Gromov wrote:
>> Sat, 11 Oct 2008 12:16:43 +0200,
>> Sascha Katzner wrote:
>>> Benji Smith wrote:
>>>> Actually, when it comes to string processing, D is decidedly *not* a
>>>> "performance language".
>>>>
>>>> Compared to...say...Java (which gets a bum rap around here for being
>>>> slow), D is nothing special when it comes to string processing speed.
>>>>
>>>> I've attached a couple of benchmarks, implemented in both Java and D
>>>> (the "shakespeare.txt" file I'm benchmarking against is from the
>>>> Gutenburg project. It's about 5 MB, and you can grab it from here:
>>>> http://www.gutenberg.org/dirs/etext94/shaks12.txt )
>>>>
>>>> In some of those benchmarks, D is slightly faster. In some of them,
>>>> Java is a lot faster. Overall, on my machine, the D code runs in
>>>> about 12.5 seconds, and the Java code runs in about 2.5 seconds.
>>>>
>>>> Keep in mind, all java characters are two-bytes wide. And you can't
>>>> access a character directly. You have to retrieve it from the String
>>>> object, using the charAt() method. And splitting a string creates a
>>>> new object for every fragment.
>>>>
>>>> I admire the goal in D to be a performance language, but it drives me
>>>> crazy when people use performance as justification for an inferior
>>>> design, when other languages that use the superior design also
>>>> accomplish superior performance.
>>> I think your benchmark is not very meaningful. Without going into
>>> implementation details of Tango (because I don't use Tango) here are
>>> some notes:
>>>
>>> - The D version uses UTF8 strings whereas the Java version uses
>>> "wanna-be" UTF16 (Java has a lot of problems with surrogates). This
>>> means you are comparing apples with pears (D has to *parse* an UTF8
>>> string and Java simply uses an wchar array without proper surrogate
>>> handling in *many* cases).
>> This is the whole point. The benchmark is valid because it performs
>> the same *task*, and the task is somewhat close to real world. It
>> measures *time*, which is universal. The compared languages use
>> different approaches and techniques to achieve the goal, that's why
>> benchmark is useful. It allows to justify usefulness of these
>> languages for a particular class of tasks.
>>
>>> - At least in runCharIterateTest() you also convert the D UTF8 string
>>> also additionally into an UTF32 string, in the Java version you did
>>> not do this.
>> Same as above. If they were using the same approach there wouldn't be
>> much to benchmark. Why don't you mention, for instance, that Java is a
>> virtual machine?
>>
>>> - The StringBuilder in the Java version is *much* faster because it
>>> doesn't have to allocate a new memory block in each step. You can use
>>> a similar class in D too, without the need of a special string
>>> class/object.
>> I agree here. Both word tango.text.Util.split and runConcatenateTest
>> use default array appending which is currently dead slow. Benji, to
>> actually compare the speed of string operations you better use one of
>> array builders discussed in this group.
>
> If anyone wants to try it, I'm pasting the draft version from std.array
> below.
>
> Andrei
>
> //------------------------------------------------------------------------------
>
> struct Appender(A : T[], T)
> {
> private T[] * pArray;
> private size_t _capacity;
>
> this(T[] * p)
> {
> pArray = p;
> if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
> _capacity = .capacity(pArray.ptr) / T.sizeof;
> }
>
> this(this)
> {
> enforce(pArray);
> }
>
> T[] data()
> {
> return pArray ? *pArray : null;
> }
>
> size_t capacity() const { return pArray ? pArray.length : 0; }
>
> void write(T item)
> {
> if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
> if (pArray.length < _capacity)
> {
> // Should do in-place construction here
> pArray.ptr[pArray.length] = item;
> *pArray = pArray.ptr[0 .. pArray.length + 1];
> }
> else
> {
> // Time to reallocate, do it and cache capacity
> *pArray ~= item;
> _capacity = .capacity(pArray.ptr) / T.sizeof;
> }
> }
>
> static if (is(const(T) : T))
> {
> alias const(T) AcceptedElementType;
> }
> else
> {
> alias T AcceptedElementType;
> }
>
> void write(AcceptedElementType[] items)
> {
> for (; !items.empty(); items.next()) {
> write(items.head());
> }
> }
>
> static if (is(const(T) == const(char))) {
> void write(in wchar wc) { assert(false); }
> void write(in wchar[] wcs)
> {
> encode!(T)(wcs, *this);
> }
> void write(in dchar dc) { assert(false); }
> void write(in dchar[] dcs)
> {
> encode!(T)(dcs, *this);
> }
> }
>
> void clear()
> {
> if (!pArray) return;
> pArray.length = 0;
> _capacity = .capacity(pArray.ptr) / T.sizeof;
> }
> }
>
> auto appender(T)(T[] * t)
> {
> Appender!(T[]) r = Appender!(T[])(t);
> return r;
> }
>
> unittest
> {
> auto arr = new char[0];
> auto app = appender(&arr);
> string b = "abcdefg";
> foreach (char c; b) app.write(c);
> assert(app.data == "abcdefg");
> }
Two notes:
1) I thought Appender would have an 'append' method as well as opCatAssign.
2) Shouldn't the following member be called 'size'/'length' instead?
size_t capacity() const { return pArray ? pArray.length : 0; }
'capacity' would look like size_t capacity() const { return _capacity; }
More information about the Digitalmars-d
mailing list