Reading text (I mean "real" text...)

Denis noreply at noserver.lan
Sat Jun 20 01:35:56 UTC 2020


THE PROBLEM

UTF-8 validation alone is insufficient for ensuring that a file 
contains only human-readable text, because control characters are 
UTF-8 valid. Apart from tab, newline, carriage return, and a few 
less commonly used others considered to be whitespace, 
human-readable text files should not normally contain embedded 
control characters.

In the standard library, the read functions for "text" files (as 
opposed to binary files) that I looked at are not actually based 
on "human readable text", but on "characters". For example:

  - In std.stdio, readln accepts everything. Lines are simply 
separated by the occurrence of a newline or user-designated 
character.
  - In std.file, readText accepts all valid UTF-8 characters.

This means, for example, that all of these functions will happily 
try to read an enormous file of zeroes in its entirety (something 
that should not even be considered "text") into a string 
variable, on the very first call to the read function. Not 
good... Whereas a function that reads only "human-readable text" 
should instead generate an exception immediately upon 
encountering an invalid control character or invalid UTF-8 
character.

THE OBJECTIVE

The objective is to read a file one line at a time (reading each 
line into a string), while checking for human-readable text 
character by character. Invalid characters (control and UTF-8) 
should generate an exception.

Unless there's already an existing function that works as 
described, I'd like to write one. I expect that this will require 
combining an existing read-by-UTF8-char or read-by-byte function 
with the additional validation.

Q1: Which existing functions (D or C) would you suggest 
leveraging? For example, there are quite a few variants of "read" 
and in different libraries too. For a newcomer, it can be 
difficult to intuit which one is best suited for what.

Q2: Any source code (D or C) you might suggest I look at, to get 
ideas for how parts of this could be written?

Thanks for your help.


More information about the Digitalmars-d-learn mailing list