Need for (C++20) Contiguous Range

Fri Oct 9 13:55:19 UTC 2020

On Friday, 9 October 2020 at 13:31:18 UTC, Steven Schveighoffer 
wrote:
> We are talking about a range that is also contiguous in memory. 
> The only logical reason to require this is to get a ptr and a 
> length and use the CPU specialized instructions for these 
> things. We have that type pattern, it's a slice. C++ didn't 
> have that formally defined, so they needed to add something.

Yes, C++ has data(), but that does not imply that you provide an 
array interface. Actually, I'm not even sure it implies that the 
elements are in order... but it probably does now that they 
require random-access properties.

Either way, you want a SIMD aligned slice. I guess if one had 
some kind of interface where you fetchUnaligned first and from 
there fetchAligned, and the finish with fetchUnaligned could work.

A better solution is to have a bitmask that tells you whether the 
element is active or not. Then you just provide an aligned 
pointer to the range-algorithm. But this isn't going to work for 
safe code.

Modern SIMD on x86 has this built in.

This also means that you can continue to use SIMD operations 
after filter.

So when you "pop" slices you might get a bit mask sequence like 
this

00001111 11111111 11100000

(I show 8 elements per slice here, probably should be configured 
to 2, 4, 8 16, 32, 64, 128, 256)

Ok, so then you have a filter than will return every other 
element, then you get the same slices, but a different mask 
pattern:

00001010 10101010 10100000

Pretty neat actually.

This would also work with randomly populated buffers (like a 
hash).

> And when I got to thinking about the performance of such 
> buffers (like, let's say a ring buffer that has 2 slices over 
> the same underlying buffer, so you never have to copy), the 
> cost of *indexing* is going to overwhelm the benefit of not 
> having to copy. I even implemented a ringbuffer using memory 
> map tricks, and it's not any better than a straight array.

I've literally spent weeks implementing various (some threadsafe) 
ringbuffers. It can be surprisingly challenging when you start to 
deal with slices and "transactional-like" features. Simple 
concept, but it can be a challenge to get up to speed and also 
being sure that it is correct.

There are some neat tricks one can use when implementing 
ringbuffers that are 2^n depending on what operations you desire. 
One important question is whether you always require one element 
to be unused or not, that tiny detail can have a major impact on 
the implementation :-D.

Anyway, when implementing a ring buffer, only include the 
operations you actually need. The more generic it is, the more 
overhead you'll get. Sadly. It is mind-blowing how many different 
implementation strategies there are. (for such a trivial concept)

It is nearly impossible to create a best-of-breed library 
implementation for that reason. Very sensitive to the use context.

But this is getting off-topic...

> Contiguous memory has the properties that allow the CPU to 
> shine, and the way to say "I have contiguous memory" in D is: 
> use a slice.

But that kinda implies that the buffer is indexable and in order.