Splitting up large dirty file

Mon May 21 17:42:19 UTC 2018

On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote:
> On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote:
> > It's unfortunate that Phobos tells you 'there's problems with
> > the encoding' without providing any means to fix it or even
> > diagnose it.
>
> I have to take that back since I found out about std.encoding
> which has functions like `sanitize`, but also `transcode`. (My
> file turned out to actually be encoded with ANSI / Windows-1252,
> not UTF-8)
> Documentation is scarce however, and it requires strings instead
> of forward ranges.
>
> @Jon Degenhardt
>
> > Instead of:
> >      auto outputFile = new File("output.txt");
> >
> > try:
> >     auto outputFile = File("output.txt", "w");
>
> Wow I really butchered that code. So it is the `drop(4)` that
> triggers the UTFException?

drop is range-based, so if you give it a string, it's going to decode
because of the whole auto-decoding mess with std.range.primitives.front and
popFront. If you can't have auto-decoding, you either have to be dealing
with functions that you know avoid it, or you need to do use something like
std.string.representation or std.utf.byCodeUnit to get around the
auto-decoding. If you're dealing with invalid Unicode, you basically have to
either convert it all up front or do something like treat it like binary
data, or Phobos is going to try to decode it as Unicode and give you a
UTFExceptions.

> I find Exceptions in range code hard to interpret.

Well, if you just look at the stack trace, it should tell you. I don't see
why ranges would be any worse than any other code except for maybe the fact
that it's typical to chain a lot of calls, and you frequently end up with
wrapper types in the stack trace that you're not necessarily familiar with.
The big problem here really is that all you're really being told is that
your string has invalid Unicode in it somewhere and the chain of function
calls that resulted in std.utf.decode being called on your invalid Unicode.
But even if you weren't dealing with ranges, if you passed invalid Unicode
to something completely string-based which did decoding, you'd run into
pretty much the same problem. The data is being used outside of its original
context where you could easily figure out what it relates to, so it's going
to be a problem by its very nature. The only real solution there is to be
controlling the decoding yourself, and even then, it's easy to be in a
position where it's hard to figure out where in the data the bad data is
unless you've done something like keep track of exactly what index your at,
which really doesn't work well once you're dealing with slicing data.

> @Kagamin
>
> > Do it old school?
>
> I want to be convinved that Range programming works like a charm,
> but the procedural approaches remain more flexible (and faster
> too) it seems. Thanks for the example.

The whole auto-decoding mess makes things worse than they should be, but if
you find procedural examples more flexible, then I would guess that that
would be simply a matter of getting more experience with ranges. Ranges are
far more composable in terms of how they're used, which tends to inherently
make them more flexible. However, it does result in code that's a mixture of
functional and procedural programming, which can be quite a shift for some
folks. So, there's no question that it takes some getting used to, but D
does allow for the more classic approaches, and ranges are not always the
best approach.

As for performance, that depends on the code and the compiler. It wouldn't
surprise me if dmd didn't optimize out the range stuff as much as it really
should, but it's my understanding that ldc typically manages to generate
code where the range abstraction didn't cost you anything. If there's an
issue, I think that it's frequently an algorithmic one or the fact that some
range-processing has a tendency to process the same data multiple times,
because that's the easiest, most abstract way to go about it and works in
general but isn't always the best solution.

For instance, because of how the range API works, when using splitter, if
you iterate through the entire range, you pretty much have to iterate
through it twice, because it does look-ahead to find the delimiter and then
returns you a slice up to that point, after which, you process that chunk of
the data to do whatever it is you want to do with each split piece. At a
conceptual level, what you're doing with your code with splitter is then
really clean and easy to write, and often, it should be plenty efficient,
but it does require going over the data twice, whereas if you looped over
the data yourself, looking for each delimiter, you'd only need to iterate
over it once. So, in cases like that, I'd fully expect the abstraction to
cost you, though whether it costs enough to matter depends on what you're
doing.

As is the case when dealing with most abstractions, I think that it's mostly
a matter of using it where it makes sense to write cleaner code more quickly
and then later figuring out the hot spots where you need to optimize better.
In many cases, ranges will be pretty much the same as writing loops, and in
others, the abstraction is worth the cost. Where it isn't, you don't use
them or implement something yourself rather than using the standard function
for it, because you can write something faster for your use case.

Just the other day, I refactored some code to not use splitter, because in
that particular case, it was costing too much, but there are still tons of
cases where I'd use splitter without thinking twice about it, because it's
the simplest, fastest way to get the job done, and it's going to be fast
enough in most cases.

- Jonathan M Davis