coreutils with D trials, wc, binary vs well formed utf
Bastiaan Veelo
Bastiaan at Veelo.net
Tue May 25 00:06:03 UTC 2021
On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
[...]
> Just bumped into
> https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/
[...]
> Is there a(n easy-ish) way to fix up that wc.d source in the
> blog to fallback to byte stream mode when a utf-8 reader fails
> an encoding?
Welcome, Brian.
I have allowed myself to use exception handling and `filter`,
which I regard to be no longer branch free. But it does (almost)
produce the same output as gnu wc:
```d
Line toLine(char[] l) pure {
import std.utf : UTFException, byChar;
import std.ascii : isWhite;
import std.algorithm : filter;
try {
return Line(l.byCodePoint.walkLength,
l.splitter.walkLength);
}
catch (UTFException) {
return Line(l.length, l.byChar.splitter!(isWhite).
filter!(w => w.length >
0).walkLength);
}
}
```
The number of chars can be returned in O(0) by the `.length`
property. Use of `byChar.splitter!(isWhite)` considers the ASCII
values of the chars, but without the `filter` it counts too many
words. The reason is that a mix of different white space
characters causes problems (https://run.dlang.io/is/QzjTN0):
```d
writeln("Hello \t D".splitter!isWhite); // ["Hello", "", "",
"D"]
writeln("Hello \t D".splitter); // ["Hello", "D"]
```
This surprises me, could be a bug.
So `filter!(w => w.length > 0)` filters out the "words" with zero
length...
Compared to gnu wc this reports one line too many for me, though.
There may be more elegant solutions than mine.
-- Bastiaan.
More information about the Digitalmars-d-learn
mailing list