Updates to the tsv-utils toolkit

Jon Degenhardt via Digitalmars-d-announce digitalmars-d-announce at puremagic.com
Sat Mar 4 11:48:21 PST 2017


On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
wrote:
> It's not quite a year since the open-sourcing of eBay's tsv 
> utilities. Since then there have been a number of additions and 
> updates, and the tools form a more complete package. The tools 
> assist with manipulation of tabular data files common in 
> machine learning and data mining environments. They work 
> alongside traditional Unix command line tools like 'cut', and 
> 'sort'. They also fit well with data mining and stats packages 
> like R and Pandas.
>
> The tools include filtering, slicing, joins and other 
> manipulation, sampling, and statistical calculations. If you 
> find yourself working with large data files from a unix shell, 
> you may like these tools.
>
> Speed matters when processing large data files, and these tools 
> are fast. I've published new benchmarks comparing the tools to 
> similar tools written in several native compiled programming 
> languages. The tools are the fastest on five of the six 
> benchmarks run, generally by significant margins. It's a good 
> result for the D programming language. The benchmarks may be of 
> interest regardless of your interest in the tools themselves.
>
> Repository: https://github.com/eBay/tsv-utils-dlang
> Performance benchmarks: 
> https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md
>
> --Jon

One more update: Schveiguy helped identify the performance 
bottleneck in the csv2tsv tool, now the tools are the fastest on 
all six benchmarks. Benchmarks have been updated (and reformatted 
a bit). Summary table here: 
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md#top-four-in-each-benchmark


More information about the Digitalmars-d-announce mailing list