eBay's TSV Utilities status update
Jon Degenhardt
jond at noreply.com
Mon Apr 29 15:23:47 UTC 2019
An update on changes to this tool-set over the last year.
For those not familiar, tsv-utils are a set of command tools for
manipulating large tabular data files. Files of numeric and text
data common in machine learning and data mining environments.
Filtering, statistics, sampling, joins, and more. The tools are
intended for large files, larger than ideal for loading in-memory
in tools like R or Pandas, but not so big as to necessitate
moving to distributed compute environments. The tools are quite
fast, the fastest of their kind available.
Besides being real tools, tsv-utils have also provided an
environment for exploring the D programming language and the D
ecosystem.
In past year there have been two main areas of work.
One area is the sampling and shuffling facilities provided by the
tsv-sample program. New sampling methods are available and
performance has been improved. tsv-sample is very similar to the
excellent GNU shuf tool, but supports sampling methods not
available in shuf. Sampling is a rich and diverse area, and the
tsv-sample code is perhaps the most algorithmically interesting
the tool-set.
The other main update is improved I/O read performance in many of
the tools. This is from developing a buffered version of byLine.
It is especially effective for skinny files (short lines). Most
of the tools saw performance gains of 10-40%.
One of the earlier performance improvements came from buffering
output lines. Combined, the line-by-line read-write performance
is quite a bit faster than what is available in Phobos. The
iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are
faster still, these are the place to go for really high
performance. (See the links below for a benchmark report.)
Links:
* tsv-utils repo: https://github.com/eBay/tsv-utils
* tsv-sample user docs:
https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference
* tsv-sample code docs:
https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
* Performance benchmarks on line-oriented I/O facilities:
https://github.com/jondegenhardt/dcat-perf/issues/1
More information about the Digitalmars-d-announce
mailing list