Top 5

Sat Oct 11 07:05:26 PDT 2008

Sergey Gromov wrote:
> Sat, 11 Oct 2008 12:16:43 +0200,
> Sascha Katzner wrote:
>> Benji Smith wrote:
>>> Actually, when it comes to string processing, D is decidedly *not* a 
>>> "performance language".
>>>
>>> Compared to...say...Java (which gets a bum rap around here for being 
>>> slow), D is nothing special when it comes to string processing speed.
>>>
>>> I've attached a couple of benchmarks, implemented in both Java and D 
>>> (the "shakespeare.txt" file I'm benchmarking against is from the 
>>> Gutenburg project. It's about 5 MB, and you can grab it from here: 
>>> http://www.gutenberg.org/dirs/etext94/shaks12.txt )
>>>
>>> In some of those benchmarks, D is slightly faster. In some of them, Java 
>>> is a lot faster. Overall, on my machine, the D code runs in about 12.5 
>>> seconds, and the Java code runs in about 2.5 seconds.
>>>
>>> Keep in mind, all java characters are two-bytes wide. And you can't 
>>> access a character directly. You have to retrieve it from the String 
>>> object, using the charAt() method. And splitting a string creates a new 
>>> object for every fragment.
>>>
>>> I admire the goal in D to be a performance language, but it drives me 
>>> crazy when people use performance as justification for an inferior 
>>> design, when other languages that use the superior design also 
>>> accomplish superior performance.
>> I think your benchmark is not very meaningful. Without going into 
>> implementation details of Tango (because I don't use Tango) here are 
>> some notes:
>>
>> - The D version uses UTF8 strings whereas the Java version uses 
>> "wanna-be" UTF16 (Java has a lot of problems with surrogates). This 
>> means you are comparing apples with pears (D has to *parse* an UTF8 
>> string and Java simply uses an wchar array without proper surrogate 
>> handling in *many* cases).
> 
> This is the whole point.  The benchmark is valid because it performs the 
> same *task*, and the task is somewhat close to real world.  It measures 
> *time*, which is universal.  The compared languages use different 
> approaches and techniques to achieve the goal, that's why benchmark is 
> useful.  It allows to justify usefulness of these languages for a 
> particular class of tasks.
> 
>> - At least in runCharIterateTest() you also convert the D UTF8 string 
>> also additionally into an UTF32 string, in the Java version you did not 
>> do this.
> 
> Same as above.  If they were using the same approach there wouldn't be 
> much to benchmark.  Why don't you mention, for instance, that Java is a 
> virtual machine?
> 
>> - The StringBuilder in the Java version is *much* faster because it 
>> doesn't have to allocate a new memory block in each step. You can use a 
>> similar class in D too, without the need of a special string class/object.
> 
> I agree here.  Both word tango.text.Util.split and runConcatenateTest 
> use default array appending which is currently dead slow.  Benji, to 
> actually compare the speed of string operations you better use one of 
> array builders discussed in this group.

If anyone wants to try it, I'm pasting the draft version of Appender 
from std.array below.

Andrei

struct Appender(A : T[], T)
{
     private T[] * pArray;
     private size_t _capacity;

     this(T[] * p)
     {
         pArray = p;
         if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
         _capacity = .capacity(pArray.ptr) / T.sizeof;
     }

     T[] data()
     {
         return pArray ? *pArray : null;
     }

     size_t capacity() const { return _capacity; }

     void write(T item)
     {
         if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
         if (pArray.length < _capacity)
         {
             // Should do in-place construction here
             pArray.ptr[pArray.length] = item;
             *pArray = pArray.ptr[0 .. pArray.length + 1];
         }
         else
         {
             // Time to reallocate, do it and cache capacity
             *pArray ~= item;
             _capacity = .capacity(pArray.ptr) / T.sizeof;
         }
     }

     static if (is(const(T) : T))
     {
         alias const(T) AcceptedElementType;
     }
     else
     {
         alias T AcceptedElementType;
     }

     void write(AcceptedElementType[] items)
     {
         for (; !items.empty(); items.next()) {
             write(items.head());
         }
     }

     static if (is(const(T) == const(char))) {
         void write(in wchar wc) { assert(false); }
         void write(in wchar[] wcs)
         {
             encode!(T)(wcs, *this);
         }
         void write(in dchar dc) { assert(false); }
         void write(in dchar[] dcs)
         {
             encode!(T)(dcs, *this);
         }
     }

     void clear()
     {
         if (!pArray) return;
         pArray.length = 0;
         _capacity = .capacity(pArray.ptr) / T.sizeof;
     }
}

auto appender(T)(T[] * t)
{
     Appender!(T[]) r = Appender!(T[])(t);
     return r;
}