Using lazy code to process large files

kdevel via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Aug 2 11:28:52 PDT 2017


On Wednesday, 2 August 2017 at 17:37:09 UTC, Steven Schveighoffer 
wrote:

> What is expected? What I see on the screen when I run my code 
> is:
>
> [Ü]

Upper case?

> What I see when I run your "working" code is:
>
> [?]

Your terminal is incapable of rendering the Latin-1 encoding. The 
program prints one byte of value 0xfc. You may pipe the output 
into hexdump -C:

00000000  5b fc 5d 0a                                       |[ü].|
00000004

> You are missing the point that your input string is invalid.

It's perfectly okay to put any value a octet can take into an 
octet. I did not claim that the data in the string memory is 
syntactically valid UTF-8. Read the comment in line 9 of my post 
of 15:02:22.

> std.algorithm is not validating the entire string,

True and it should not. So this is what I want.

> and so it doesn't throw an error like string.stripLeft does.

That is the point. You wrote

| I wouldn't expect good performance from this, as there is 
auto-decoding all
| over the place.

I erroneously thought that using byCodeUnit disables the whole 
UTF-8 processing and enforces operation on (u)bytes. But this is 
not the case at least not for stripLeft and probably other string 
functions.

> writeln doesn't do any decoding of individual strings. It 
> avoids the problem and just copies your bad data directly.

That is what I expected.




More information about the Digitalmars-d-learn mailing list