Top 5

Sat Oct 11 07:17:23 PDT 2008

On Sat, 11 Oct 2008 18:00:38 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> Sergey Gromov wrote:
>> Sat, 11 Oct 2008 12:16:43 +0200,
>> Sascha Katzner wrote:
>>> Benji Smith wrote:
>>>> Actually, when it comes to string processing, D is decidedly *not* a  
>>>> "performance language".
>>>>
>>>> Compared to...say...Java (which gets a bum rap around here for being  
>>>> slow), D is nothing special when it comes to string processing speed.
>>>>
>>>> I've attached a couple of benchmarks, implemented in both Java and D  
>>>> (the "shakespeare.txt" file I'm benchmarking against is from the  
>>>> Gutenburg project. It's about 5 MB, and you can grab it from here:  
>>>> http://www.gutenberg.org/dirs/etext94/shaks12.txt )
>>>>
>>>> In some of those benchmarks, D is slightly faster. In some of them,  
>>>> Java is a lot faster. Overall, on my machine, the D code runs in  
>>>> about 12.5 seconds, and the Java code runs in about 2.5 seconds.
>>>>
>>>> Keep in mind, all java characters are two-bytes wide. And you can't  
>>>> access a character directly. You have to retrieve it from the String  
>>>> object, using the charAt() method. And splitting a string creates a  
>>>> new object for every fragment.
>>>>
>>>> I admire the goal in D to be a performance language, but it drives me  
>>>> crazy when people use performance as justification for an inferior  
>>>> design, when other languages that use the superior design also  
>>>> accomplish superior performance.
>>> I think your benchmark is not very meaningful. Without going into  
>>> implementation details of Tango (because I don't use Tango) here are  
>>> some notes:
>>>
>>> - The D version uses UTF8 strings whereas the Java version uses  
>>> "wanna-be" UTF16 (Java has a lot of problems with surrogates). This  
>>> means you are comparing apples with pears (D has to *parse* an UTF8  
>>> string and Java simply uses an wchar array without proper surrogate  
>>> handling in *many* cases).
>>  This is the whole point.  The benchmark is valid because it performs  
>> the same *task*, and the task is somewhat close to real world.  It  
>> measures *time*, which is universal.  The compared languages use  
>> different approaches and techniques to achieve the goal, that's why  
>> benchmark is useful.  It allows to justify usefulness of these  
>> languages for a particular class of tasks.
>>
>>> - At least in runCharIterateTest() you also convert the D UTF8 string  
>>> also additionally into an UTF32 string, in the Java version you did  
>>> not do this.
>>  Same as above.  If they were using the same approach there wouldn't be  
>> much to benchmark.  Why don't you mention, for instance, that Java is a  
>> virtual machine?
>>
>>> - The StringBuilder in the Java version is *much* faster because it  
>>> doesn't have to allocate a new memory block in each step. You can use  
>>> a similar class in D too, without the need of a special string  
>>> class/object.
>>  I agree here.  Both word tango.text.Util.split and runConcatenateTest  
>> use default array appending which is currently dead slow.  Benji, to  
>> actually compare the speed of string operations you better use one of  
>> array builders discussed in this group.
>
> If anyone wants to try it, I'm pasting the draft version from std.array  
> below.
>
> Andrei
>
> //------------------------------------------------------------------------------
>
> struct Appender(A : T[], T)
> {
>      private T[] * pArray;
>      private size_t _capacity;
>
>      this(T[] * p)
>      {
>          pArray = p;
>          if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
>          _capacity = .capacity(pArray.ptr) / T.sizeof;
>      }
>
>      this(this)
>      {
>          enforce(pArray);
>      }
>
>      T[] data()
>      {
>          return pArray ? *pArray : null;
>      }
>
>      size_t capacity() const { return pArray ? pArray.length : 0; }
>
>      void write(T item)
>      {
>          if (!pArray) pArray = (new typeof(*pArray)[1]).ptr;
>          if (pArray.length < _capacity)
>          {
>              // Should do in-place construction here
>              pArray.ptr[pArray.length] = item;
>              *pArray = pArray.ptr[0 .. pArray.length + 1];
>          }
>          else
>          {
>              // Time to reallocate, do it and cache capacity
>              *pArray ~= item;
>              _capacity = .capacity(pArray.ptr) / T.sizeof;
>          }
>      }
>
>      static if (is(const(T) : T))
>      {
>          alias const(T) AcceptedElementType;
>      }
>      else
>      {
>          alias T AcceptedElementType;
>      }
>
>      void write(AcceptedElementType[] items)
>      {
>          for (; !items.empty(); items.next()) {
>              write(items.head());
>          }
>      }
>
>      static if (is(const(T) == const(char))) {
>          void write(in wchar wc) { assert(false); }
>          void write(in wchar[] wcs)
>          {
>              encode!(T)(wcs, *this);
>          }
>          void write(in dchar dc) { assert(false); }
>          void write(in dchar[] dcs)
>          {
>              encode!(T)(dcs, *this);
>          }
>      }
>
>      void clear()
>      {
>          if (!pArray) return;
>          pArray.length = 0;
>          _capacity = .capacity(pArray.ptr) / T.sizeof;
>      }
> }
>
> auto appender(T)(T[] * t)
> {
>      Appender!(T[]) r = Appender!(T[])(t);
>      return r;
> }
>
> unittest
> {
>      auto arr = new char[0];
>      auto app = appender(&arr);
>      string b = "abcdefg";
>      foreach (char c; b) app.write(c);
>      assert(app.data == "abcdefg");
> }

Two notes:
1) I thought Appender would have an 'append' method as well as opCatAssign.
2) Shouldn't the following member be called 'size'/'length' instead?
size_t capacity() const { return pArray ? pArray.length : 0; }

'capacity' would look like size_t capacity() const { return _capacity; }