Error running concurrent process and storing results in array
data pulverizer
data.pulverizer at gmail.com
Fri May 8 13:36:22 UTC 2020
On Thursday, 7 May 2020 at 14:49:43 UTC, data pulverizer wrote:
> After running the Julia code by the Julia community they made
> some changes (using views rather than passing copies of the
> array) and their time has come down to ~ 2.5 seconds. The plot
> thickens.
I've run the Chapel code past the Chapel programming language
people and they've brought the time down to ~ 6.5 seconds. I've
disallowed calling BLAS because I'm looking at the performance of
the programming language implementations rather than it's ability
to call other libraries.
So far the times are looking like this:
D: ~ 1.5 seconds
Julia: ~ 2.5 seconds
Chapel: ~ 6.5 seconds
I've been working on the Nim benchmark and have written a little
byte order set of functions for big -> little endian stuff
(https://gist.github.com/dataPulverizer/744fadf8924ae96135fc600ac86c7060) which was fun and has the ntoh, hton, and so forth functions that can be applied to any basic type. Now writing a little matrix type in the same vein as the D matrix type I wrote and then do the easy bit which is writing the kernel matrix algorithm itself.
In the end I'll run the benchmark on data of various sizes.
Currently I'm just running it on the (10,000 x 784) data set
which outputs a (10,000 x 10,000) matrix. I'll end up running
(5,000 x 784), (10,000 x 784), (20,000 x 784), (30,000 x 784),
(40,000 x 784), (50,000 x 784), and (60,000 x 784). Ideally I'd
measure each on 100 times and plot confidence intervals, but I'll
have to settle for measuring each one 3 times and take an average
otherwise it will take too much time. I don't think that D will
have it it's own way for all the data sizes, from what I can see,
Julia may do better at the largest data set, maybe simd will be a
factor there.
The data set sizes are not randomly chosen. In many common data
science tasks maybe > 90% of what data scientists currently work
on, people work with data sets in this range or even smaller, the
big data stuff is much less common unless you're working for
Google (FANGs) or a specialist startup. I remember running a
kernel cluster in often used "data science" languages (none of
which I'm benchmarking here) and it wasn't done after an hour and
then hung and crashed, I implemented something in Julia and it
was done in a minute. Calculating kernel matrices is the
cornerstone of many kernel-based machine learning libraries
kernel PCA, Kernel Clustering, SVM and so on. It's a pretty
important thing to calculate and shows the potential of these
languages in the data science field. I think an article like this
is valid for people that implement numerical libraries. I'm also
hoping to throw in C++ by way of comparison.
More information about the Digitalmars-d-learn
mailing list