Prevent opening binary/other garbage files

bauss jj_1337 at live.dk
Sun Sep 30 06:17:20 UTC 2018


On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
> I'm writing a utility that checks for specific keyword(s) found 
> in the files in a given directory recursively. What's the best 
> strategy to avoid opening a bin file or some sort of garbage 
> dump? Check encoding of the given file?
>
> If so, what are the most popular encodings (in POSIX if that 
> matters) and how do I detect them?

What I would do is read the frist 512 bytes and the last 512 
bytes and if over 50% of those bytes are below 32 and not 8, 9, 
10, 11, 12 or 13 then chances are you have a binary file, but 
there is nothing that stops someone from writing "invalid" bytes 
into a text file. There are no limitations on what a file can 
hold and generally the system treats all files the same.

The reason I recommend to read the first 512 and last 512 bytes 
is because some binary files may contain legit text strings etc. 
so by picking two places chances are you won't have two segments 
with text.


More information about the Digitalmars-d-learn mailing list