Making generalized Trie type in D

Mon Jun 11 09:43:47 PDT 2012

On Monday, 11 June 2012 at 14:48:27 UTC, Dmitry Olshansky wrote:
> On 11.06.2012 16:11, Roman D. Boiko wrote:
>> On Tuesday, 5 June 2012 at 13:27:12 UTC, Roman D. Boiko wrote:
>>> maybe it is better to avoid immutability... or do bulk ins 
>>> /del before
>>> copy.
>> Would it be difficult to introduce two optimizations:
>> * have strings of all lengths above configurable threshold go 
>> to the same bucket (is it reasonable?)
>
> The sample I listed did just that: for length >= 64 it goes to 
> bucket of 0-length strings. After all it's your prefix 
> function, with any distribution you like ;)
Great

>> * have ability to store a checkpoint, then do a series of 
>> modifications,
>> and store another checkpoint but reuse data which have not 
>> been affected.
>>
> yea, that's a rough sketch of what I want to support.
That's wonderful :)

>> For the second optimization a possible design would be to 
>> store internal state as snapshot + diffs, and apply diffs when 
>> creating another snapshot. Diff format should allow 
>> efficiently performing trie interface operations.
>
> It may be. Diff could be a bunch of pages that are XOR-ed on 
> top of snapshot. Dunno if it's worthwhile trick yet.
It should be, provided that a single insert / delete doesn't 
affect many pages and size of page is reasonably small. Only need 
to create pages corresponding to those which are affected; and 
there is no need to track separate diffs between snapshots, so 
they can be combined.

Another option might be creating a separate trie for insertions 
and tracking deletions somehow, provided that tries can be merged 
efficiently.

Actually, in my case, deletions could be deferred and performed 
in bulk. OTOH, I will need to count how many times a string is 
inserted minus number of times it has been deleted. 
Alternatively, I could just check from time to time which strings 
are not needed any more (micro-GC :) ).

There are many possible data structures, but the one you 
mentioned seems to be the most reasonable.