Marketing of D - article topic ideas?
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Jun 9 07:37:51 PDT 2010
On 06/09/2010 01:28 AM, retard wrote:
> Wed, 09 Jun 2010 01:13:43 -0400, Nick Sabalausky wrote:
>
>> "retard"<re at tard.com.invalid> wrote in message
>> news:hun6ok$13s5$1 at digitalmars.com...
>>> Tue, 08 Jun 2010 16:14:51 -0500, Andrei Alexandrescu wrote:
>>>
>>>> On 06/08/2010 04:05 PM, Walter Bright wrote:
>>>>> Andrei Alexandrescu wrote:
>>>>>> On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
>>>>>>> Please define "reasonable performance"...
>>>>>>
>>>>>> Within 15% of hand-optimized code specialized for the types at hand.
>>>>>
>>>>> I would have said O(n) or O(log n), as opposed to, say, O(n*n).
>>>>>
>>>>> General rules for performance improvements:
>>>>>
>>>>> 1. nobody notices a 10% improvement
>>>>>
>>>>> 2. users will start noticing speedups when they exceed 2x
>>>>>
>>>>> 3. a 10x speedup is a game changer
>>>>
>>>> max of n elements is O(n).
>>>
>>> This probably means that D 2 won't be very efficient on multicore until
>>> the authors learn some basic parallel programming skills. Now where did
>>> you get your PhD - I'm collecting a list of toy universities people
>>> should avoid.
>>
>> You used to have meaningful things to say. Now you're just trolling.
>
> Max of n unordered elements can be solved in O(log log n) time assuming
> you have enough cores and constant time memory access. Happy now?
When calculating the complexity of an operation you don't consider cores
in unlimited supply; it's a constant. Complexity being calculated in
terms of the number of inputs, max of n elements is O(n) because you
need to look at each element at least once.
Cores being a constant k, you can then consider that max can be
distributed such that each core looks at n/k elements. What's left is k
intermediate results, which can be further processed in log(k) time (see
below); since we consider k a constant, this approach could claim a net
k-fold speedup.
If available cores are proportional to n (an unusual assumption), you
first compare pairs of the n elements with n/2 cores, then n/4
comparisons against pairs of the remaining elements etc. Each step
halves the number of candidates so the complexity is O(log n). I'd be
curious to find out how that can be reduced to O(log log n).
Finally, for small values of n, you could consider the number of cores
sufficiently large, but, as was pointed out, using threads for max is
actually impractical, so superscalar execution may be a better venue.
Practically this means exploiting ILP by reducing data dependencies
between intermediate results. I suggest you take a look at a thread
entitled "challenge: implement the max function" that I started on
01/21/2007. That thread discusses ILP issues, and continues with the
thread "challenge #2: implement the varargs_reduce metafunction". Allow
me to quote from my own post on 01/23/2007:
==================
That's a good point, and goes right to the core of my solution, which
(similarly to Bruno Medeiros') arranges operations in an optimal way for
superscalar evaluation, e.g. max(a, b, c, d) is expanded not to:
max2(max2(max2(a, b), c), d)
but instead to:
max2(max2(a, b), max2(c, d))
The number of operations is the same but in the latter case there is
logaritmically less dependency between partial results. When max2
expands into a primitive comparison, the code is easy prey for a
superscalar machine to execute in parallel.
This won't count in a great deal of real code, but the fact that this
will go in the standard library warrants the thought. Besides, the fact
that the language makes it so easy, we can actually think of such
subtlety, is very encouraging.
==================
The messages that follow further discuss how associative reduce should
be ordered to take advantage of superscalar execution.
Andrei
More information about the Digitalmars-d
mailing list