Making generalized Trie type in D

Mon Jun 4 13:09:00 PDT 2012

On Monday, 4 June 2012 at 19:18:32 UTC, Dmitry Olshansky wrote:
> On 04.06.2012 22:42, Roman D. Boiko wrote:
>> ... it is possible to create persistent (immutable) tries
>> with efficient (log N) inserting / deleting (this scenario is 
>> very important for my DCT project). Immutable hash tables 
>> would require 0(N) copying for each insert / delete.
>
> Aye, the good thing about them - the amount of data affected by 
> any change is localized to one "page". So you just copy it over 
> and remap indices/pointers.
Actually, Microsoft went this way (immutable hash tables) in 
their Roslyn preview implementation. However, I still believe 
that tries will work better here. Will check...

Would bulk pre-allocation of memory to be used by trie improve 
locality? With some heuristics for copying when it goes out of 
control.

>> It is difficult to create a good API for fundamental data 
>> structures,
>> because various use cases would motivate different trade-offs. 
>> The same
>> is true for implementation. This is why I like your decision to
>> introduce policies for configuration. Rationale and use cases 
>> should
>> help to analyze design of your API and implementation, thus 
>> you will get
>> better community feedback :)
>
> Well I guess I'll talk in depth about them in the article, as 
> the material exceed sane limits of a single NG post.
>
> In brief:
> 	- multiple levels are stored in one memory chunk one after 
> another thus helping a bit with cache-locality (first level 
> goes first)
> 	- constructors do minimize number of "pages" on each level by 
> constructing it outwards from the last level and checking 
> duplicates (costs ~ O(N^2) though, IRC)
So this price is paid only on construction, right? Are there 
alternatives to chose from (when needed)? If yes, which?

> 	- I learned the hard way not to introduce extra conditionals 
> anywhere, so there is no "out of range, max index, not 
> existent" crap, in all cases it's clean-cut memory access. 
> Extra bits lost on having at least one "default" page per level 
> can be saved by going extra level
Could you please elaborate? How do you handle situations when not 
existent, etc., is needed?

>> Your examples deal with lookup by the whole word (first/last 
>> characters and length are needed). Are your API and 
>> implementation adaptable for character-by-character trie 
>> lookup?
>
> I would say that one by one won't help you much since the speed 
> is almost the same if not worse.
I guess, in general your statement is true, especially because 
known length could improve speed significantly. Not sure (but can 
easily believe) that in my particular situation it is true. For 
example, one by one would allow ignoring key encoding (and thus 
using multiple encodings simultaneously just as easily as single).

> The problematic thing with one by one - say you want to stop 
> early, right?
Why? I'd like to lex inout as TokenKind.InOut, not TokenKind.In 
followed by TokenKind.Out. Did I misunderstand your question?

> Now you have to check the *NOT FOUND* case, and that implies 
> extra branching (if(...)) on _each level_ and maybe reusing 
> certain valid values as "not found" marker (extra 
> configuration?).
This should be easy, if something is not a keyword, it is likely 
an identifier. But I agree in general, and probably even in my 
case.

> Branching and testing are things that kill speed advantage of 
> Tries, the ones I overlooked in my previous attempt, see 
> std/internal/uni.d.
> The other being separate locations for data and index, 
> pointer-happy disjoint node(page) locality is another way of 
> the same fault.
This concern disturbs me for some time already, and slightly 
demotivates, because implementing something this way will likely 
lead to a failure. I don't have enough experience with 
alternatives to know their advantages and trade-offs. I'll check 
your code. I did plan to try table lookup instead of branching. I 
guess making my own mistakes is necessary anyway.

>> Will compile-time generation of lookup code based on tries be 
>> supported?
>> Example which is currently in DCT (first implemented by Brian 
>> Schott in
>> his Dscanner project) uses switch statements (which means 
>> lookup linear
>> in number of possible characters at each position).
>
> Nope, the days of linear lookup for switch are over (were there 
> even such days?) compiler always do binary search nowadays if 
> linear isn't more efficient e.g. for a small number of 
> values.(it even weight out which is better and uses a 
> combination of them)
I thought this *might* be the case, but didn't know nor checked 
anywhere. I also wanted to do linear search for some empirically 
chosen small number of items.

> However you'd better check asm code afterwards. Compiler is 
> like a nasty stepchild it will give up on generating good old 
> jump tables given any reason it finds justifiable. (but it may 
> use few small jump tables + binary search, could be fine... if 
> not in a tight loop!)
Thanks.

>> A trivial
>> improvement might be using if statements and binary lookup. 
>> (E.g., if
>> there are 26 possible characters used at some position, use 
>> only 5
>> comparisons, not 26).
>>
>
> Moreover you'd be surprised but such leap-frog binary search 
> looses by a big margin to _multilayer_ lookup table. (I for one 
> was quite shocked back then)
Thanks again :) Are any numbers or perf. tests available?

>> I wanted to analyse your regex implementation, but that's not 
>> an easy
>> task and requires a lot of effort...
>
> Yeah, sorry for some encrypted Klingon here and there ;)
>
>>It looks like the most promising
>> alternative to binary trie lookup which I described in previous
>> paragraph. Similarities and differences with your regex design 
>> might
>> also help us understand tries better.
>
> Well if you are in no rush you might as well just take my 
> latest development in the ways of Trie from my gsoc fork :)
> Skipping some gory days of try and trial (fail) in the process 
> ;)
OK