Making generalized Trie type in D

Mon Jun 4 12:18:31 PDT 2012

On 04.06.2012 22:42, Roman D. Boiko wrote:
> On Monday, 4 June 2012 at 15:35:31 UTC, Dmitry Olshansky wrote:
>> On 04.06.2012 18:19, Andrei Alexandrescu wrote:
>>> On 6/4/12 4:46 AM, Dmitry Olshansky wrote:
>>>> Cross-posting from my GSOC list as I think we had a lack of D rox posts
>>>> lately :)
>>> [snip]
>>>
>>> I think it would be great if you converted this post into an article.
>>>
>>> Andrei
>>>
>>
>> Sounds good, will do once I fix few issues that were mentioned
>> (bit-packing, GC types etc.)
>
> Would be interesting to see some more examples, along with
> rationale/motivation for various aspects of your API, and possible usage
> scenarios.
>
> Tries are fundamental (both data structures and respective algorithms)
> for lookup problems, in the same way as arrays are fundamental for
> indexed access.
>
> For example, they have several advantages over hash tables. Hash
> calculation requires const * key.length operations, which is equivalent
> to the number of comparisons needed for trie lookup. But hash tables may
> be less space efficient, or lead to hash collisions which increase
> lookup time. Also, it is possible to create persistent (immutable) tries
> with efficient (log N) inserting / deleting (this scenario is very
> important for my DCT project). Immutable hash tables would require 0(N)
> copying for each insert / delete.

Aye, the good thing about them - the amount of data affected by any 
change is localized to one "page". So you just copy it over and remap 
indices/pointers.

>
> It is difficult to create a good API for fundamental data structures,
> because various use cases would motivate different trade-offs. The same
> is true for implementation. This is why I like your decision to
> introduce policies for configuration. Rationale and use cases should
> help to analyze design of your API and implementation, thus you will get
> better community feedback :)

Well I guess I'll talk in depth about them in the article, as the 
material exceed sane limits of a single NG post.

In brief:
	- multiple levels are stored in one memory chunk one after another thus 
helping a bit with cache-locality (first level goes first)
	- constructors do minimize number of "pages" on each level by 
constructing it outwards from the last level and checking duplicates 
(costs ~ O(N^2) though, IRC)
	- I learned the hard way not to introduce extra conditionals anywhere, 
so there is no "out of range, max index, not existent" crap, in all 
cases it's clean-cut memory access. Extra bits lost on having at least 
one "default" page per level can be saved by going extra level
	- extra levels ain't that bad as they look, since memory is close anyway.

>
> Below are some notes related to my DCT use cases.
>
> Your examples deal with lookup by the whole word (first/last characters
> and length are needed). Are your API and implementation adaptable for
> character-by-character trie lookup?

I would say that one by one won't help you much since the speed is 
almost the same if not worse.
The problematic thing with one by one - say you want to stop early, right?
Now you have to check the *NOT FOUND* case, and that implies extra 
branching (if(...)) on _each level_ and maybe reusing certain valid 
values as "not found" marker (extra configuration?).

Branching and testing are things that kill speed advantage of Tries, the 
ones I overlooked in my previous attempt, see std/internal/uni.d.
The other being separate locations for data and index, pointer-happy 
disjoint node(page) locality is another way of the same fault.

>
> Will compile-time generation of lookup code based on tries be supported?
> Example which is currently in DCT (first implemented by Brian Schott in
> his Dscanner project) uses switch statements (which means lookup linear
> in number of possible characters at each position).

Nope, the days of linear lookup for switch are over (were there even 
such days?) compiler always do binary search nowadays if linear isn't 
more efficient e.g. for a small number of values.(it even weight out 
which is better and uses a combination of them)

However you'd better check asm code afterwards. Compiler is like a nasty 
stepchild it will give up on generating good old jump tables given any 
reason it finds justifiable. (but it may use few small jump tables + 
binary search, could be fine... if not in a tight loop!)

A trivial
> improvement might be using if statements and binary lookup. (E.g., if
> there are 26 possible characters used at some position, use only 5
> comparisons, not 26).
>

Moreover you'd be surprised but such leap-frog binary search looses by a 
big margin to _multilayer_ lookup table. (I for one was quite shocked 
back then)

> I wanted to analyse your regex implementation, but that's not an easy
> task and requires a lot of effort...

Yeah, sorry for some encrypted Klingon here and there ;)

>It looks like the most promising
> alternative to binary trie lookup which I described in previous
> paragraph. Similarities and differences with your regex design might
> also help us understand tries better.

Well if you are in no rush you might as well just take my latest 
development in the ways of Trie from my gsoc fork :)
Skipping some gory days of try and trial (fail) in the process ;)

-- 
Dmitry Olshansky