More fun with autodecoding

Thu Aug 9 17:33:07 UTC 2018

On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighoffer 
wrote:
> Not trying to give too much away about the library I'm writing, 
> but the problem I'm trying to solve is parsing out tokens from 
> a buffer. I want to delineate the whole, as well as the parts, 
> but it's difficult to get back to the original buffer once you 
> split and slice up the buffer using phobos functions.

I wonder if there are some parallels in the tsv utilities I 
wrote. The tsv parser is extremely simple, byLine and splitter on 
a char buffer. Most of the tools just iterate the split result in 
order, but a couple do things like operate on a subset of fields, 
potentially reordered. For these a separate structure is created 
that maps back the to original buffer to avoid copying. Likely 
quite simple compared to what you are doing.

The csv2tsv tool may be more interesting. Parsing is relatively 
simple, mostly identifying field values in the context of CSV 
escape syntax. It's modeled as reading an infinite stream of 
utf-8 characters, byte-by-byte. Occasionally the bytes forming 
the value need to be modified due to the escape syntax, but most 
of the time the characters in the original buffer remain 
untouched and parsing is identifying the start and end positions.

The infinite stream is constructed by reading fixed size blocks 
from the input stream and concatenating them with joiner. This 
eliminates the need to worry about utf-8 characters spanning 
block boundaries, but it comes at a cost: either write 
byte-at-a-time, or make an extra copy (also byte-at-a-time). 
Making an extra copy is faster, that what the code does. But, as 
a practical matter, most of the time large blocks could often be 
written directly from the original input buffer.

If I wanted it make it faster than current I'd do this. But I 
don't see an easy way to do this with phobos ranges. At minimum 
I'd have to be able to run code when the joiner operation hits 
block boundaries. And it'd also be necessary to create a mapping 
back to the original input buffer.

Autodecoding comes into play of course. Basically, splitter on 
char arrays is fine, but in a number of cases it's necessary to 
work using ubtye to avoid the performance penalty.

--Jon