Interesting performance data-point

Tue Dec 31 15:36:31 UTC 2024

As I've mentioned in previous messages, I've ported my personal 
finance package from C to D, having first ported some of it to 
Rust until I just couldn't stand it anymore.

One of the utilities that exists in both the D and Rust versions 
reads .csv files downloaded from American Express and loads the 
transactions into the Sqlite database that contains my financial 
data, trying to assign an expense account to each incoming 
transaction by fuzzy-comparing the transaction's description to 
existing transactions, using an algorithm based on Levenshtein 
distance. The Levenshtein calculation is done using a 
user-defined Sqlite function that is loaded as an extension.

What I've found is that the D version of this utility is about 
twice as fast (compiled with DMD) as the Rust version to get 
identical results. While I haven't done detailed enough 
measurements to explain the performance disparity with certainty, 
I've done enough to know that both versions spend most of their 
time in the Levenshtein distance function.

But I have a theory that I think is the likely explanation. And 
if I'm correct, it highlights one of D's strongest points -- the 
ability to call C libraries directly, without the need for an 
elaborate interface layer.

What I think is going on is that rusqlite, the crate that is 
Rust's primary Sqlite interface package, does not provide a way 
to step through the results of a select query, as the Sqlite 
library itself does, stopping when you are happy. Instead, you 
run the 'query' method (or one of its variants) on a prepared 
statement, which either returns an iterator for you to access all 
the returned rows or calls a closure to process each row. This 
difference matters when each row involves an expensive 
calculation.

In my case, I want the most recent transaction that meets the 
Levenshtein distance criterion, which will be the first row in 
the result set, since I order them by post-date descending. In D, 
I am able to step the match query and either I get a row or I 
don't. If I do, I stop, use that transaction's expense account 
and I'm done. The entire result set is not computed. In Rust, 
rusqlite computes the entire result set, which is expensive due 
to the Levenshtein calculation, and then hands it to me row by 
row.

It is not a simple matter to convince Sqlite to restrict the 
result set to the most recent row. 'limit 1' makes no difference 
in the Rust application's performance (I tried it). Apparently 
Sqlite applies 'limit' after computing the result set. There 
*may* be a way to do this using Sqlite's windowing capability, 
but that's a bit of a research project that I have no inclination 
to take on.

I have also not found a Rust crate that provides step-level 
control over Sqlite *and* lets you load extensions.

I think this illustrates a strength of D that I don't think 
enough people understand -- the ability to talk directly and 
easily to the C world. People complain that D doesn't have a rich 
set of libraries. It doesn't need one; all the C libraries are 
almost as easily accessible from D as they are from C or C++. And 
this has gotten even easier with the advent of ImportC, which I 
think is a very important addition to D and worth continued 
development to hide the craziness in C header files.

In my case, in D, I can use a straight-forward query and have the 
same simple interaction with Sqlite that I would have in C. There 
may be a way to match D's performance in this case with Rust, but 
it would require effort, perhaps a lot. This is typical of the 
Rust experience compared to D. Things are just more difficult, 
mainly because the user plays a bigger role in memory management 
in Rust than in languages, like D, that provide a GC (I simply do 
not understand the anti-GC religious fanatics, especially when we 
are talking about ordinary applications on today's multi-ghz 
hardware with huge amounts of memory). D's performance is 
comparable (except in the case of the AMEX utility, where it is a 
lot better) and the code is more readable. Unfortunately, people 
jump on band-wagons mindlessly.