coreutils with D trials, wc, binary vs well formed utf

Bastiaan Veelo Bastiaan at Veelo.net
Tue May 25 00:06:03 UTC 2021


On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
[...]
> Just bumped into 
> https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/

[...]

> Is there a(n easy-ish) way to fix up that wc.d source in the 
> blog to fallback to byte stream mode when a utf-8 reader fails 
> an encoding?

Welcome, Brian.

I have allowed myself to use exception handling and `filter`, 
which I regard to be no longer branch free. But it does (almost) 
produce the same output as gnu wc:
```d
Line toLine(char[] l) pure {
     import std.utf : UTFException, byChar;
     import std.ascii : isWhite;
     import std.algorithm : filter;
     try {
         return Line(l.byCodePoint.walkLength, 
l.splitter.walkLength);
     }
     catch (UTFException) {
         return Line(l.length, l.byChar.splitter!(isWhite).
                               filter!(w => w.length > 
0).walkLength);
     }
}
```
The number of chars can be returned in O(0) by the `.length` 
property. Use of `byChar.splitter!(isWhite)` considers the ASCII 
values of the chars, but without the `filter` it counts too many 
words. The reason is that a mix of different white space 
characters causes problems (https://run.dlang.io/is/QzjTN0):
```d
     writeln("Hello \t D".splitter!isWhite); // ["Hello", "", "", 
"D"]
     writeln("Hello \t D".splitter);         // ["Hello", "D"]
```
This surprises me, could be a bug.

So `filter!(w => w.length > 0)` filters out the "words" with zero 
length...

Compared to gnu wc this reports one line too many for me, though.

There may be more elegant solutions than mine.

-- Bastiaan.


More information about the Digitalmars-d-learn mailing list