Fun project - faster associative array algorithm

Marco Leise via Digitalmars-d digitalmars-d at puremagic.com
Wed Apr 8 10:46:34 PDT 2015


Am Tue, 07 Apr 2015 15:26:45 -0700
schrieb Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org>:

> On 4/7/15 3:18 PM, bearophile wrote:
> > Andrei Alexandrescu:
> >
> >>> A possible cache-friendly replacement would be an array of buckets
> >>> with local probing to resolve collisions.
> >>
> >> Arrays would need to move data. Current hashtables rely on values
> >> staying put. -- Andrei
> >
> > The efficiency behavour of modern CPUs+memory pyramid are rather not
> > linear and not intuitive. As Walter has said at the the start of this
> > thread, arrays come out as more efficient in a large number of cases...
> 
> You must have replied to the wrong post. -- Andrei
 
No, that's really what Walter said:
 "A possible cache-friendly replacement would be an array
[...]."

Especially when iterating arrays, multiple elements might be
in the same cache-line and the CPU's prefetching will kick in
and load the next array elements while the current ones are
processed.

Unexpected memory fetches from RAM can easily be >100 times
slower.
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (pg. 22)

https://software.intel.com/en-us/forums/topic/287236
"
Core i7 Xeon 5500 Series

Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

[...]

a) Structure data such that computationally related information resides within same cache line.
b) Write your algorithms such that the higher frequency of access occures on (near) adjacent memory. This reduces the TLB pressure.
c) Reduce number of writes to RAM through use of temporary varibles that can be registerized (optimized into register)
d) Structure data such that you can manipulate using SSE (when possible)
e) For parallel programming, coordinate activities amongst threads sharing cache levels. (HT siblings for L1 and L2, same die or half die for L3, same NUMA node for multiple nodes).

-- Jim Dempsey"

-- 
Marco



More information about the Digitalmars-d mailing list