Weka.IO in the news... but not mentioning Dlang... why?

Sat Sep 23 16:09:30 UTC 2017

On 23/09/17 11:57, Suliman wrote:
>> One is a linear database and the other is a filesystem?
>>
>> If that doesn't satisfy you, please describe to me the difference 
>> between D and Microsoft Word, so I know what kind of answer you're 
>> expecting.
>>
> 
> But Hadoop is more look like file system that DataBase...

Hadoop Distributed File System is, sort of, a file system. I don't know 
much about it (just read the Wikipedia page), so I'll try to answer as 
best I understand. Corrections welcome:

Performance:
I have not idea what HDFS's per-node performance numbers are, but there 
are several indications that make me suspect they are not as good as Weka's.

First of all, I don't think a tool written in Java, designed to run over 
another file system and the kernel's networking has any chance of 
out-performing a tool written in D, the directly uses the NVME and the 
network interface.

Second, the file system seems oriented toward large read-only blobs. As 
a file system, I don't think it has any chance against any dedicated 
Posix compliant file system, but I'm guessing you're mostly interested 
in using HDFS as a basis for running Hadoop itself, so that might not 
matter.

Cost:
Here I don't think there is any way for HDFS to compete. That might 
sound strange to some, as HDFS is open source while Weka charge 
licensing fees. The reason I'm saying this is because HDFS uses 
mirroring in order to achieve fault tolerance, while Weka uses Raid (I 
should know - I wrote it).

In short, to get 1PB of usable capacity while tolerating 2 faults you'll 
need 3PB of raw capacity with Hadoop (200% overhead). At 16+2, you'll 
only need around 1.3PB with Weka. Whatever you're paying for the 
licenses is, in all likelihood, going to be less than the cost of the 
hardware.

Like I said, corrections are welcome, as I'm not familiar with HDFS or 
Hadoop.

Shachar