Prevent opening binary/other garbage files

Adam D. Ruppe destructionator at gmail.com
Sat Sep 29 16:01:18 UTC 2018


On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
> I'm writing a utility that checks for specific keyword(s) found 
> in the files in a given directory recursively. What's the best 
> strategy to avoid opening a bin file or some sort of garbage 
> dump? Check encoding of the given file?

Simplest might be to read the first few bytes (like couple 
hundred probably) and if any of them are < 32 && != '\t' && != 
'\r' && != '\n' && != 0, there's a good chance it is a binary 
file.

Text files are frequently going to have tabs and newlines, but 
not so frequently other low bytes.

If you do find a bunch of 0's, but not the other values, you 
might have a utf-16 file.

> If so, what are the most popular encodings (in POSIX if that 
> matters) and how do I detect them?

for text on posix computers they are likely going to be utf8, and 
you can try using Phobos' readText function. It will throw if it 
encounters non-utf8, so you catch that and go on to the next one.

But the simpler check described above will also probably work and 
can read less of the file.


More information about the Digitalmars-d-learn mailing list