Marketing of D - article topic ideas?

Wed Jun 9 07:37:51 PDT 2010

On 06/09/2010 01:28 AM, retard wrote:
> Wed, 09 Jun 2010 01:13:43 -0400, Nick Sabalausky wrote:
>
>> "retard"<re at tard.com.invalid>  wrote in message
>> news:hun6ok$13s5$1 at digitalmars.com...
>>> Tue, 08 Jun 2010 16:14:51 -0500, Andrei Alexandrescu wrote:
>>>
>>>> On 06/08/2010 04:05 PM, Walter Bright wrote:
>>>>> Andrei Alexandrescu wrote:
>>>>>> On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
>>>>>>> Please define "reasonable performance"...
>>>>>>
>>>>>> Within 15% of hand-optimized code specialized for the types at hand.
>>>>>
>>>>> I would have said O(n) or O(log n), as opposed to, say, O(n*n).
>>>>>
>>>>> General rules for performance improvements:
>>>>>
>>>>> 1. nobody notices a 10% improvement
>>>>>
>>>>> 2. users will start noticing speedups when they exceed 2x
>>>>>
>>>>> 3. a 10x speedup is a game changer
>>>>
>>>> max of n elements is O(n).
>>>
>>> This probably means that D 2 won't be very efficient on multicore until
>>> the authors learn some basic parallel programming skills. Now where did
>>> you get your PhD - I'm collecting a list of toy universities people
>>> should avoid.
>>
>> You used to have meaningful things to say. Now you're just trolling.
>
> Max of n unordered elements can be solved in O(log log n) time assuming
> you have enough cores and constant time memory access. Happy now?

When calculating the complexity of an operation you don't consider cores 
in unlimited supply; it's a constant. Complexity being calculated in 
terms of the number of inputs, max of n elements is O(n) because you 
need to look at each element at least once.

Cores being a constant k, you can then consider that max can be 
distributed such that each core looks at n/k elements. What's left is k 
intermediate results, which can be further processed in log(k) time (see 
below); since we consider k a constant, this approach could claim a net 
k-fold speedup.

If available cores are proportional to n (an unusual assumption), you 
first compare pairs of the n elements with n/2 cores, then n/4 
comparisons against pairs of the remaining elements etc. Each step 
halves the number of candidates so the complexity is O(log n). I'd be 
curious to find out how that can be reduced to O(log log n).

Finally, for small values of n, you could consider the number of cores 
sufficiently large, but, as was pointed out, using threads for max is 
actually impractical, so superscalar execution may be a better venue. 
Practically this means exploiting ILP by reducing data dependencies 
between intermediate results. I suggest you take a look at a thread 
entitled "challenge: implement the max function" that I started on 
01/21/2007. That thread discusses ILP issues, and continues with the 
thread "challenge #2: implement the varargs_reduce metafunction". Allow 
me to quote from my own post on 01/23/2007:

==================
That's a good point, and goes right to the core of my solution, which 
(similarly to Bruno Medeiros') arranges operations in an optimal way for 
superscalar evaluation, e.g. max(a, b, c, d) is expanded not to:

max2(max2(max2(a, b), c), d)

but instead to:

max2(max2(a, b), max2(c, d))

The number of operations is the same but in the latter case there is 
logaritmically less dependency between partial results. When max2 
expands into a primitive comparison, the code is easy prey for a 
superscalar machine to execute in parallel.

This won't count in a great deal of real code, but the fact that this 
will go in the standard library warrants the thought. Besides, the fact 
that the language makes it so easy, we can actually think of such 
subtlety, is very encouraging.
==================

The messages that follow further discuss how associative reduce should 
be ordered to take advantage of superscalar execution.

Andrei