Making generalized Trie type in D

Mon Jun 4 13:40:01 PDT 2012

[snip]
>> Aye, the good thing about them - the amount of data affected by any
>> change is localized to one "page". So you just copy it over and remap
>> indices/pointers.
> Actually, Microsoft went this way (immutable hash tables) in their
> Roslyn preview implementation. However, I still believe that tries will
> work better here. Will check...
>
> Would bulk pre-allocation of memory to be used by trie improve locality?
> With some heuristics for copying when it goes out of control.
>

Sorry my Trie implementation focused on "constructe once - read 
everywhere" case. Like I noted mutation is not hard but it's easy to 
blow up size if used unknowingly.
I think along the lines of:

auto trie = ...;//create it
for(...)
{
	trie[x] = ...;//modify repeatedly
}
trie.compact();//redo same algorithm as during construction, O(N^2)

>>> It is difficult to create a good API for fundamental data structures,
>>> because various use cases would motivate different trade-offs. The same
>>> is true for implementation. This is why I like your decision to
>>> introduce policies for configuration. Rationale and use cases should
>>> help to analyze design of your API and implementation, thus you will get
>>> better community feedback :)
>>
>> Well I guess I'll talk in depth about them in the article, as the
>> material exceed sane limits of a single NG post.
>>
>> In brief:
>> - multiple levels are stored in one memory chunk one after another
>> thus helping a bit with cache-locality (first level goes first)
>> - constructors do minimize number of "pages" on each level by
>> constructing it outwards from the last level and checking duplicates
>> (costs ~ O(N^2) though, IRC)
> So this price is paid only on construction, right? Are there
> alternatives to chose from (when needed)? If yes, which?

See above. Price is paid every time you want squeeze it in as little 
memory as possible by removing duplicate pages.
>
>> - I learned the hard way not to introduce extra conditionals anywhere,
>> so there is no "out of range, max index, not existent" crap, in all
>> cases it's clean-cut memory access. Extra bits lost on having at least
>> one "default" page per level can be saved by going extra level
> Could you please elaborate? How do you handle situations when not
> existent, etc., is needed?
>
Easy say you have 2-level 4 entry each level (for sake of simplicity), 
say the final value is int.

Then in highly redundant or just when there is little amount of values 
in initial data set (both cases are equivalent, thus tries are easily 
invertible btw):

LVL 0: [0, 1, 0, 0]
This first one always occupies full size (it's asserted that there is no 
index >= 4)
LVL 1: [0, 0, 0, 0] [1, 0, 2, 3]
Note again - only full pages, no cheating and half-pages, but we can 
save on the amount of them (note almost obligatory zero page)

So it contains only 1, 2 and 3 at indexes of 4, 6, 7 respectively, 
T.init is our way of not EXISTS ... yes, I think that should be user 
definable.
This way there is no checks anywhere, only shift, add dereference, 
shift, add, dereference...

>> I would say that one by one won't help you much since the speed is
>> almost the same if not worse.
> I guess, in general your statement is true, especially because known
> length could improve speed significantly. Not sure (but can easily
> believe) that in my particular situation it is true.

That and some simple minded hashing of say first char and the length ;)
In any case you need to munch the whole symbol, I think.

> For example, one by
> one would allow ignoring key encoding (and thus using multiple encodings
> simultaneously just as easily as single).

It's just as easy with the whole thing. Treat it as bytes ;)

>
>> The problematic thing with one by one - say you want to stop early,
>> right?
> Why? I'd like to lex inout as TokenKind.InOut, not TokenKind.In followed
> by TokenKind.Out. Did I misunderstand your question?
>

No I meant stopping on say 2 level of 5 level Trie because the element 
was not found. Stupid idea generally.

>> Branching and testing are things that kill speed advantage of Tries,
>> the ones I overlooked in my previous attempt, see std/internal/uni.d.
>> The other being separate locations for data and index, pointer-happy
>> disjoint node(page) locality is another way of the same fault.
> This concern disturbs me for some time already, and slightly
> demotivates, because implementing something this way will likely lead to
> a failure. I don't have enough experience with alternatives to know
> their advantages and trade-offs. I'll check your code. I did plan to try
> table lookup instead of branching. I guess making my own mistakes is
> necessary anyway.

It could enlightening just don't give up too quickly and don't jump to 
conclusions. In fact try to be sympathetic with "loosing party", like in 
... "hm this way is much slower, so bad - I have to optimize it 
somehow". In other words make sure you squeezed all you can from "slow" 
method.

>> Nope, the days of linear lookup for switch are over (were there even
>> such days?) compiler always do binary search nowadays if linear isn't
>> more efficient e.g. for a small number of values.(it even weight out
>> which is better and uses a combination of them)
> I thought this *might* be the case, but didn't know nor checked
> anywhere. I also wanted to do linear search for some empirically chosen
> small number of items.

Compare with C's memchr and such just in case, it was ~4x faster then 
Phobos find last time I checked.

>>> A trivial
>>> improvement might be using if statements and binary lookup. (E.g., if
>>> there are 26 possible characters used at some position, use only 5
>>> comparisons, not 26).
>>>
>>
>> Moreover you'd be surprised but such leap-frog binary search looses by
>> a big margin to _multilayer_ lookup table. (I for one was quite
>> shocked back then)
> Thanks again :) Are any numbers or perf. tests available?

Well I might recompile regex to use binary search vs Trie again, off the 
record it was ~2 times slower minimum. That was sorted array vs oldish 
Trie, note however that it's responsible for only ~40% of actual time of 
matching... so it's around 2.5 * 2 = 5

Compiler's switch I *believe* could be at least ~2x better as it's fully 
unrolled binary search,
but hard to guess at how much actually and it depends on the CPU model 
by the end of day.
The older CPU the better lookup tables generally are, except these 
Celeron mules with half the cache ;)

-- 
Dmitry Olshansky