Making generalized Trie type in D

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Jun 4 02:46:04 PDT 2012


Cross-posting from my GSOC list as I think we had a lack of D rox posts 
lately :)

I rarely am surprised by generic code flexibility and power these days.
But once I fleshed out last bits of generalized Trie I was amazed.

In short: it allows something like "make a multistage integer lookup 
table using these x upper bits, these y bits in the middle and those 
last z bits as final offset" as one-liner. Better yet strings and other 
keys are just as easy, and even better it's up to user to allow custom 
ways of getting X bits for any key.
And while I writing this bit-packing is on it's way to get integrated 
thus obtaining ultra-small tables (we can pack even index entries since 
we know how much bits they take exactly).

And last but no least - it's easy to strap Trie on top of some other 
container (say arrays, sets, hashtable etc.).  (currently it doesn't 
play nice with GC-ed objects, I'll fix it later)
Say book catalog that goes by first latter (or two) as Trie stage then 
as final slot uses sorted array. The advantages  of such schemes is that 
re-sorting huge arrays  can get costly, moreover insertion can 
reallocate and that might even run out of memory while duping it. On the 
other hand smaller array keep sorting bounded at acceptable level.

(In fact hashtable is a degenerate case of 1-level Trie on top of e.g. 
List, just not space-optimized and with a weird  index transform).

To give you a taste of this power, consider a simple example - keyword 
recognizer. I'm showing two version first just checks if it's a keyword. 
Second one does catalogs it to specific kind.
(sample material taken from DCT project, thanks goes to Roman Boiko)

enum keywords = [
             "abstract",
             "alias",
             "align",
             //...  all of them, except @ ones
             "__traits"
     ];

//here is bare minumum set type usable for Trie
  struct SmallMap(size_t N, V, K)
     {
         void insert(Tuple!(V, K) t){ _set.insert(t); }

         V opBinaryRight(string op, T)(T key)
             if(op == "in")
         {
             auto idx = map!"a[1]"(_set.items[]).countUntil(key);
             return idx < 0 ? V.init : _set.items[idx][0];
         }
         private:
             SmallSet!(N, Tuple!(V, K)) _set;
     }
//define how we get bits for our index
    size_t useLength(T)(T[] arr)
     {
         return arr.length > 63 ? 0 : arr.length; //need max length, 64 
- 6bits
     }

     template useLastItem(T)
     {
         size_t entity(in T[] arr){ return arr[$-1]; }
         enum bitSize = 8*T.sizeof;
     }
     //... useItemAt is similar


//now the usage:
auto keyTrie = Trie!(SetAsSlot!(SmallSet!(2,string))
                          , string
                          , assumeSize!(6, useLength)
                          , useItemAt!(0, char)
                          , useLastItem!(char)
                         )(keywords);

foreach(key; keywords)
//opIndex gives us a slot, here slot is set so test it with 'in'
         assert( key in keyTrie[key]);


And that's it. Importantly there is not a single bit of hardcoding here:
     - we choose to make Trie of sets by passing SetAsSlot for value 
type (simple Trie may just use say bool)
     - key type is string oviously
     - assumeSize takes a user define functor and "trusts" user that it 
is result indeed fits into X bit wide integer
     - the the table as constructed goes by levels: by length -> by 
first char -> by last char -> <small set>

As you can observe "alias" and "align" do collide after 3 levels passed, 
thus we go for fixed 2-slot sets at the end to resolve it and a few others.
Of course there are better ways to handle level that would prevent this 
"collision" and reduce size, it's just to show how Trie on top of Set 
works. (not to mention all keywords are ASCII alphas, hinting on better 
compression in say 6bits)

And here is how it can work with Maps as slot (the above was mapping to 
booleans in fact):

auto keywordsMap = [
             "abstract" : TokenKind.Abstract,
             "alias" : TokenKind.Alias,
             "align" : TokenKind.Align,
             //... all others, cataloged per their kind
             "__thread" : TokenKind._Thread,
             "__traits" : TokenKind._Traits,
     ];

//now the usage:
auto keyTrie = Trie!(SetAsMap!(SmallMap!(2, TokenKind, string), 
TokenKind, string)
//duplication is because not every Map type Key-Value can be derived (we 
have no standard protocol for AAs)
                          , string
                          , assumeSize!(6, useLength)
                          , useItemAt!(0, char)
                          , useLastItem!(char)
                         )(keywordsMap);

  foreach(k,v; keywordsMap)
         assert((k in keyTrie2[k]) == v);

And the only big change is the data type :

struct SmallMap(size_t N, V, K)
     {
         void insert(Tuple!(V, K) t){ _set.insert(t); }

         V opBinaryRight(string op, T)(T key)
             if(op == "in")
         {
             auto idx = map!"a[1]"(_set.items[]).countUntil(key);
             return idx < 0 ? V.init : _set.items[idx][0];
         }
         private:
             SmallSet!(N, Tuple!(V, K)) _set;
     }


P.S. Sorry that the code for Trie itself isn't presented, but you can 
find the the whole thing in my Phobos fork:
https://github.com/blackwhale/phobos/blob/gsoc-uni/std/uni.d

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list