Reading text (I mean "real" text...)
Denis
noreply at noserver.lan
Sat Jun 20 01:35:56 UTC 2020
THE PROBLEM
UTF-8 validation alone is insufficient for ensuring that a file
contains only human-readable text, because control characters are
UTF-8 valid. Apart from tab, newline, carriage return, and a few
less commonly used others considered to be whitespace,
human-readable text files should not normally contain embedded
control characters.
In the standard library, the read functions for "text" files (as
opposed to binary files) that I looked at are not actually based
on "human readable text", but on "characters". For example:
- In std.stdio, readln accepts everything. Lines are simply
separated by the occurrence of a newline or user-designated
character.
- In std.file, readText accepts all valid UTF-8 characters.
This means, for example, that all of these functions will happily
try to read an enormous file of zeroes in its entirety (something
that should not even be considered "text") into a string
variable, on the very first call to the read function. Not
good... Whereas a function that reads only "human-readable text"
should instead generate an exception immediately upon
encountering an invalid control character or invalid UTF-8
character.
THE OBJECTIVE
The objective is to read a file one line at a time (reading each
line into a string), while checking for human-readable text
character by character. Invalid characters (control and UTF-8)
should generate an exception.
Unless there's already an existing function that works as
described, I'd like to write one. I expect that this will require
combining an existing read-by-UTF8-char or read-by-byte function
with the additional validation.
Q1: Which existing functions (D or C) would you suggest
leveraging? For example, there are quite a few variants of "read"
and in different libraries too. For a newcomer, it can be
difficult to intuit which one is best suited for what.
Q2: Any source code (D or C) you might suggest I look at, to get
ideas for how parts of this could be written?
Thanks for your help.
More information about the Digitalmars-d-learn
mailing list