More fun with autodecoding
Jon Degenhardt
jond at noreply.com
Thu Aug 9 17:33:07 UTC 2018
On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighoffer
wrote:
> Not trying to give too much away about the library I'm writing,
> but the problem I'm trying to solve is parsing out tokens from
> a buffer. I want to delineate the whole, as well as the parts,
> but it's difficult to get back to the original buffer once you
> split and slice up the buffer using phobos functions.
I wonder if there are some parallels in the tsv utilities I
wrote. The tsv parser is extremely simple, byLine and splitter on
a char buffer. Most of the tools just iterate the split result in
order, but a couple do things like operate on a subset of fields,
potentially reordered. For these a separate structure is created
that maps back the to original buffer to avoid copying. Likely
quite simple compared to what you are doing.
The csv2tsv tool may be more interesting. Parsing is relatively
simple, mostly identifying field values in the context of CSV
escape syntax. It's modeled as reading an infinite stream of
utf-8 characters, byte-by-byte. Occasionally the bytes forming
the value need to be modified due to the escape syntax, but most
of the time the characters in the original buffer remain
untouched and parsing is identifying the start and end positions.
The infinite stream is constructed by reading fixed size blocks
from the input stream and concatenating them with joiner. This
eliminates the need to worry about utf-8 characters spanning
block boundaries, but it comes at a cost: either write
byte-at-a-time, or make an extra copy (also byte-at-a-time).
Making an extra copy is faster, that what the code does. But, as
a practical matter, most of the time large blocks could often be
written directly from the original input buffer.
If I wanted it make it faster than current I'd do this. But I
don't see an easy way to do this with phobos ranges. At minimum
I'd have to be able to run code when the joiner operation hits
block boundaries. And it'd also be necessary to create a mapping
back to the original input buffer.
Autodecoding comes into play of course. Basically, splitter on
char arrays is fine, but in a number of cases it's necessary to
work using ubtye to avoid the performance penalty.
--Jon
More information about the Digitalmars-d
mailing list