eBay's TSV Utilities status update

Jon Degenhardt jond at noreply.com
Mon Apr 29 15:23:47 UTC 2019


An update on changes to this tool-set over the last year.

For those not familiar, tsv-utils are a set of command tools for 
manipulating large tabular data files. Files of numeric and text 
data common in machine learning and data mining environments. 
Filtering, statistics, sampling, joins, and more. The tools are 
intended for large files, larger than ideal for loading in-memory 
in tools like R or Pandas, but not so big as to necessitate 
moving to distributed compute environments. The tools are quite 
fast, the fastest of their kind available.

Besides being real tools, tsv-utils have also provided an 
environment for exploring the D programming language and the D 
ecosystem.

In past year there have been two main areas of work.

One area is the sampling and shuffling facilities provided by the 
tsv-sample program. New sampling methods are available and 
performance has been improved. tsv-sample is very similar to the 
excellent GNU shuf tool, but supports sampling methods not 
available in shuf. Sampling is a rich and diverse area, and the 
tsv-sample code is perhaps the most algorithmically interesting 
the tool-set.

The other main update is improved I/O read performance in many of 
the tools. This is from developing a buffered version of byLine. 
It is especially effective for skinny files (short lines). Most 
of the tools saw performance gains of 10-40%.

One of the earlier performance improvements came from buffering 
output lines. Combined, the line-by-line read-write performance 
is quite a bit faster than what is available in Phobos. The 
iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are 
faster still, these are the place to go for really high 
performance. (See the links below for a benchmark report.)

Links:
* tsv-utils repo: https://github.com/eBay/tsv-utils
* tsv-sample user docs: 
https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference
* tsv-sample code docs: 
https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
* Performance benchmarks on line-oriented I/O facilities: 
https://github.com/jondegenhardt/dcat-perf/issues/1


More information about the Digitalmars-d-announce mailing list