Splitting up large dirty file

Jonathan M Davis newsgroup.d at jmdavisprog.com
Fri May 18 01:48:10 UTC 2018


On Thursday, May 17, 2018 21:10:35 Dennis via Digitalmars-d-learn wrote:
> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > For various reasons, that doesn't always hold true like it
> > should, but pretty much all of Phobos is written with that
> > assumption and will generally throw an exception if it isn't.
>
> It's unfortunate that Phobos tells you 'there's problems with the
> encoding' without providing any means to fix it or even diagnose
> it. The UTFException doesn't contain what the character in
> question was. You just have to abort whatever you were trying to
> do.

UTFException has a sequence member and a len member (which appear to be
public but undocumented) which should contain the invalid sequence of code
units. In general though, exceptions aren't a great way to deal with this
problem. I think that you either want to be calling decode manually (in
which case, you have direct access to where the invalid Unicode is and have
the freedom to deal with it however is appropriate), or using the Unicode
replacement character would be better (which std.utf.decode supports, but
it's not what's used by default). Really, what's byting you here is the
auto-decoding. With Phobos, you have to fight to have it not happen by doing
stuff like special-casing your code for strings or using
std.string.representation or std.utf.byCodeUnit.

In principle, the way that Unicode would ideally be handled would be to
validate all character data when it enters the program (soing whatever is
appropriate with invalid Unicode at that point), and then the rest of the
program then either is always dealing with valid Unicode, or it's dealing
with integral values that it doesn't treat as Unicode (e.g. ubyte[]). But
the way that Phobos is written, it ends up decoding and validating all over
the place.

> On Wednesday, 16 May 2018 at 10:30:34 UTC, Jonathan M Davis wrote:
> > If you're ever dealing with a different encoding (or with
> > invalid Unicode), you really need to use integral types like
> > ubyte
>
> I tried something like byChunk(4096).joiner.splitter(cast(ubyte)
> '\n') but it turns out splitter wants at least a forward range,
> even when the separator is a single element.

Actually, I'm pretty sure that splitter curently requires a random-access
range (even though it should theoretically work with a forward range). I
don't think that it can be made to work with an input range though given how
the range API works - or at least, it were made to work with it, you'd have
to deal with the fact that popping front on the spitter range would
invalidate anything that had been returned from front. And it would be
difficult to implement it @safely if what gets returned by front is not
completely independent of the splitter range (which means that it needs
save). Basic input ranges in general tend to be extremely limited in what
they can do, which can get really annoying when you deal with stuff like
files or sockets where making it a forward range likely means either reading
it all into memory or having buffers that potentially have to be dup-ed by
each call to save.

- Jonathan M Davis



More information about the Digitalmars-d-learn mailing list