[Greylist-users] Hashed indexes

William Blunn bill at tao-group.com
Thu Jul 22 07:18:03 PDT 2004


Was looking at my MySQL database files for my greylisting data.

With 90000 odd records in the table, the MyISAM data file was 13.8 odd
megabytes, and the index file was 13.6 odd megabytes.

I had triple fields like this:

  net24     CHAR(11)
  sender    VARCHAR(255)
  recipient VARCHAR(255)

I had an index which would allow straight-to-single-record-most-of-the-
time hits on the database:

  KEY `triple` (`exact`,`net24`,`sender`(20),`recipient`(16))
  
(I had another field "exact" which was a boolean to identify those
records where all parts of the triple were specified exactly, i.e. no
NULLs.)

This index was coming in with a key object size of 51 bytes.

I thought: The key object is a bit big.  Wouldn't it be nice if, for the
the sender and recipient, we could do a hash of the value, and index on
that?  If the index is smaller, it ought to speed up queries, and/or
make the system more scalable.

Then: wouldn't it be nice if the database supported computed indexes?

Had a sniff round.  Turns out that MySQL (at least up to version 3.23)
does not support computed indexes, nor does it support hash indexes (at
least not on any persistent table formats).

However it turns out that PostgreSQL supports both computed indexes
*and* hash indexes.  Interesting.  Seems that PostgreSQL is more
complete and correct, whereas MySQL is more rough-and-ready but
generally has the edge on performance and supports multiple table
engines.

But I am going with MySQL for the time being, so I figured I could
emulate hashed indexes by adding additional fields for the hashes.  The
hashes will be CRC-32.

I had converted the caller's network from an ASCII dotted-triple string
(11 bytes) down to an int (4 bytes), but I then added support for IPv6
which took the size back up to 8 bytes (biginit).  So it looked like a
good idea to hash the caller's network as well as the sender and
recipient.

(I also rearranged my queries so that I no longer needed the "exact"
field.  This also means that all online queries now use a single index,
and also that index always works over all three parts of the triple,
therefore it always selects zero or one records (except for the rare
occasions when there is a hash collision).)

So the triple becomes:

  net           BIGINT                 # 64 bits (copes with IPv6)
  net_crc       INT          NOT NULL  # 32 bits
  sender        VARCHAR(255)
  sender_crc    INT          NOT NULL  # 32 bits
  recipient     VARCHAR(255)
  recipient_crc INT          NOT NULL  # 32 bits

It doesn't make any odds if we hash the NULL value as zero rather than
NULL. This means we can make all the hashes be "NOT NULL".  This also
makes the key object smaller by 3 bytes.  The key becomes:

  KEY `triple` (`net_crc`,`sender_crc`,`recipient_crc`)

So the key object size is now down to 12 bytes (confirmed with EXPLAIN
SELECT).

The MyISAM data file is now 14 odd megabytes, but the index file is now
only 3.8 megabytes, i.e. 30% of the original size, or 3.3 times smaller.

It seems to work as well as it did before.  Which is nice.

Bill

The contents of this e-mail and any attachments are confidential and may
be legally privileged. If you have received this e-mail and you are not
a named addressee, please inform us as soon as possible on
+44 118 901 2999 and then delete the e-mail from your system. If you are
not a named addressee you must not copy, use, disclose, distribute,
print or rely on this e-mail. Any views expressed in this e-mail or any
attachments may not necessarily reflect those of Tao's management.
Although we routinely screen for viruses, addressees should scan this
e-mail and any attachments for viruses. Tao makes no representation or
warranty as to the absence of viruses in this e-mail or any attachments.
Please note that for the protection of our business, we may monitor and
read e-mails sent to and from our server(s).


More information about the Greylist-users mailing list