Making generalized Trie type in D
Dmitry Olshansky
dmitry.olsh at gmail.com
Mon Jun 4 02:46:04 PDT 2012
Cross-posting from my GSOC list as I think we had a lack of D rox posts
lately :)
I rarely am surprised by generic code flexibility and power these days.
But once I fleshed out last bits of generalized Trie I was amazed.
In short: it allows something like "make a multistage integer lookup
table using these x upper bits, these y bits in the middle and those
last z bits as final offset" as one-liner. Better yet strings and other
keys are just as easy, and even better it's up to user to allow custom
ways of getting X bits for any key.
And while I writing this bit-packing is on it's way to get integrated
thus obtaining ultra-small tables (we can pack even index entries since
we know how much bits they take exactly).
And last but no least - it's easy to strap Trie on top of some other
container (say arrays, sets, hashtable etc.). (currently it doesn't
play nice with GC-ed objects, I'll fix it later)
Say book catalog that goes by first latter (or two) as Trie stage then
as final slot uses sorted array. The advantages of such schemes is that
re-sorting huge arrays can get costly, moreover insertion can
reallocate and that might even run out of memory while duping it. On the
other hand smaller array keep sorting bounded at acceptable level.
(In fact hashtable is a degenerate case of 1-level Trie on top of e.g.
List, just not space-optimized and with a weird index transform).
To give you a taste of this power, consider a simple example - keyword
recognizer. I'm showing two version first just checks if it's a keyword.
Second one does catalogs it to specific kind.
(sample material taken from DCT project, thanks goes to Roman Boiko)
enum keywords = [
"abstract",
"alias",
"align",
//... all of them, except @ ones
"__traits"
];
//here is bare minumum set type usable for Trie
struct SmallMap(size_t N, V, K)
{
void insert(Tuple!(V, K) t){ _set.insert(t); }
V opBinaryRight(string op, T)(T key)
if(op == "in")
{
auto idx = map!"a[1]"(_set.items[]).countUntil(key);
return idx < 0 ? V.init : _set.items[idx][0];
}
private:
SmallSet!(N, Tuple!(V, K)) _set;
}
//define how we get bits for our index
size_t useLength(T)(T[] arr)
{
return arr.length > 63 ? 0 : arr.length; //need max length, 64
- 6bits
}
template useLastItem(T)
{
size_t entity(in T[] arr){ return arr[$-1]; }
enum bitSize = 8*T.sizeof;
}
//... useItemAt is similar
//now the usage:
auto keyTrie = Trie!(SetAsSlot!(SmallSet!(2,string))
, string
, assumeSize!(6, useLength)
, useItemAt!(0, char)
, useLastItem!(char)
)(keywords);
foreach(key; keywords)
//opIndex gives us a slot, here slot is set so test it with 'in'
assert( key in keyTrie[key]);
And that's it. Importantly there is not a single bit of hardcoding here:
- we choose to make Trie of sets by passing SetAsSlot for value
type (simple Trie may just use say bool)
- key type is string oviously
- assumeSize takes a user define functor and "trusts" user that it
is result indeed fits into X bit wide integer
- the the table as constructed goes by levels: by length -> by
first char -> by last char -> <small set>
As you can observe "alias" and "align" do collide after 3 levels passed,
thus we go for fixed 2-slot sets at the end to resolve it and a few others.
Of course there are better ways to handle level that would prevent this
"collision" and reduce size, it's just to show how Trie on top of Set
works. (not to mention all keywords are ASCII alphas, hinting on better
compression in say 6bits)
And here is how it can work with Maps as slot (the above was mapping to
booleans in fact):
auto keywordsMap = [
"abstract" : TokenKind.Abstract,
"alias" : TokenKind.Alias,
"align" : TokenKind.Align,
//... all others, cataloged per their kind
"__thread" : TokenKind._Thread,
"__traits" : TokenKind._Traits,
];
//now the usage:
auto keyTrie = Trie!(SetAsMap!(SmallMap!(2, TokenKind, string),
TokenKind, string)
//duplication is because not every Map type Key-Value can be derived (we
have no standard protocol for AAs)
, string
, assumeSize!(6, useLength)
, useItemAt!(0, char)
, useLastItem!(char)
)(keywordsMap);
foreach(k,v; keywordsMap)
assert((k in keyTrie2[k]) == v);
And the only big change is the data type :
struct SmallMap(size_t N, V, K)
{
void insert(Tuple!(V, K) t){ _set.insert(t); }
V opBinaryRight(string op, T)(T key)
if(op == "in")
{
auto idx = map!"a[1]"(_set.items[]).countUntil(key);
return idx < 0 ? V.init : _set.items[idx][0];
}
private:
SmallSet!(N, Tuple!(V, K)) _set;
}
P.S. Sorry that the code for Trie itself isn't presented, but you can
find the the whole thing in my Phobos fork:
https://github.com/blackwhale/phobos/blob/gsoc-uni/std/uni.d
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list