[Greylist-users] Hashed indexes

William Blunn bill--greylist at blunn.org
Thu Jul 22 07:21:42 PDT 2004


Was looking at my MySQL database files for my greylisting data.

With 90000 odd records in the table, the MyISAM data file was 13.8 odd
megabytes, and the index file was 13.6 odd megabytes.

I had triple fields like this:

  net24     CHAR(11)
  sender    VARCHAR(255)
  recipient VARCHAR(255)

I had an index which would allow straight-to-single-record-most-of-the-
time hits on the database:

  KEY `triple` (`exact`,`net24`,`sender`(20),`recipient`(16))
  
(I had another field "exact" which was a boolean to identify those
records where all parts of the triple were specified exactly, i.e. no
NULLs.)

This index was coming in with a key object size of 51 bytes.

I thought: The key object is a bit big.  Wouldn't it be nice if, for the
the sender and recipient, we could do a hash of the value, and index on
that?  If the index is smaller, it ought to speed up queries, and/or
make the system more scalable.

Then: wouldn't it be nice if the database supported computed indexes?

Had a sniff round.  Turns out that MySQL (at least up to version 3.23)
does not support computed indexes, nor does it support hash indexes (at
least not on any persistent table formats).

However it turns out that PostgreSQL supports both computed indexes
*and* hash indexes.  Interesting.  Seems that PostgreSQL is more
complete and correct, whereas MySQL is more rough-and-ready but
generally has the edge on performance and supports multiple table
engines.

But I am going with MySQL for the time being, so I figured I could
emulate hashed indexes by adding additional fields for the hashes.  The
hashes will be CRC-32.

I had converted the caller's network from an ASCII dotted-triple string
(11 bytes) down to an int (4 bytes), but I then added support for IPv6
which took the size back up to 8 bytes (biginit).  So it looked like a
good idea to hash the caller's network as well as the sender and
recipient.

(I also rearranged my queries so that I no longer needed the "exact"
field.  This also means that all online queries now use a single index,
and also that index always works over all three parts of the triple,
therefore it always selects zero or one records (except for the rare
occasions when there is a hash collision).)

So the triple becomes:

  net           BIGINT                 # 64 bits (copes with IPv6)
  net_crc       INT          NOT NULL  # 32 bits
  sender        VARCHAR(255)
  sender_crc    INT          NOT NULL  # 32 bits
  recipient     VARCHAR(255)
  recipient_crc INT          NOT NULL  # 32 bits

It doesn't make any odds if we hash the NULL value as zero rather than
NULL. This means we can make all the hashes be "NOT NULL".  This also
makes the key object smaller by 3 bytes.  The key becomes:

  KEY `triple` (`net_crc`,`sender_crc`,`recipient_crc`)

So the key object size is now down to 12 bytes (confirmed with EXPLAIN
SELECT).

The MyISAM data file is now 14 odd megabytes, but the index file is now
only 3.8 megabytes, i.e. 30% of the original size, or 3.3 times smaller.

It seems to work as well as it did before.  Which is nice.

Bill



More information about the Greylist-users mailing list