Multi-file byte comparison tool. What would you have done differently?

Fri Aug 5 10:13:27 PDT 2011

> On 08/05/2011 03:02 AM, Pelle wrote:
> > On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer <kai at unixlords.com> wrote:
> >> I have a need for detecting incorrect byte sequences in multiple files
> >> (>2) at a time (as a part of our porting effort to new platforms.)
> >> Ideally the files should be identical for all but a handful of byte
> >> sequences (in a header section) that I can just skip over. I thought
> >> this would be a fun exercise for my D muscles. I found success
> >> creating a dynamic array of structs to keep information about each
> >> file passed in via command-line parameters. I'll append the code at
> >> the end (and I'm sure it'll get mangled in the process...)
> >> 
> >> (I'm not one for coming up with creative names, so it's SOMETHING)
> >> Then I loop around a read for each file, then manually run a for loop
> >> from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array
> >> (one for each of the files opened), and run a function to ensure all
> >> values in the size_t array are the same. If not, I compare each ubyte
> >> value (via the byte_union) to determine which bytes are not correct by
> >> adding each byte to a separate array, and comparing each value in that
> >> array, printing the address and values of each bad byte as I encounter
> >> them.
> >> 
> >> This appears to work great. Some justifications:
> >> I used size_t because I'm under the impression it's a platform
> >> specific size that best fits into a single register, thus making
> >> comparisons faster than byte-by-byte.
> >> I used a union to extract the bytes from the size_t
> >> I wanted to create a SOMETHING for each file at run-time, instead of
> >> only allowing a certain number of SOMETHINGS (either hard coded, or a
> >> limit).
> >> Originally I wrote my own comparison function, but in my search for
> >> something more functional, I tried out std.algorithm's count. Can't
> >> say I can tell if it's better or worse.
> >> 
> >> Features I'll probably add if I have to keep using the tool:
> >> 1) Better support for starting points and bytes to read.
> >> 2) Threshold for errors encountered, perferrably managed by a
> >> command-line argument.
> >> 3) Coalescing error messages in sequential byte sequences.
> >> 
> >> When I run the program, it's certainly I/O bound at 30Mb/s to an
> >> external USB drive :).
> >> 
> >> So the question is, how would you make it more D-ish? (Do we have a
> >> term analogous to "pythonic" for D? :))
> >> 
> >> 
> >> Code:
> >> 
> >> import std.stdio;
> >> import std.file;
> >> import std.conv;
> >> import std.getopt;
> >> import std.algorithm;
> >> 
> >> enum BLOCK_SIZE = 1024;
> >> union byte_union
> >> {
> >> size_t val;
> >> ubyte[val.sizeof] bytes;
> >> }
> >> struct SOMETHING
> >> {
> >> string file_name;
> >> size_t size_bytes;
> >> File fd;
> >> byte_union[BLOCK_SIZE] bytes;
> >> }
> > 
> > I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion,
> > Something, sizeBytes, etc.
> 
> I haven't used TypeNames or nonTypeNames. Do have some reference
> material for me? Searching http://www.d-programming-language.org/ didn't
> give me anything that sounds like what you're talking about.

I believe that he was talking about how you named your variables and type 
names. Normall in D, user-defined types use pascal case (e.g. Something and 
ByteUnion, not SOMETHING and byte_union), and everything else uses camelcase 
(e.g. blockSize and fileName, not BLOCK_SIZE and file_name). It's an issue of 
style. You can name your variables however you'd like to, but what you're 
doing doesn't match what most of the people around here do, which can make it 
harder to read the code that you post. He was simply suggesting that you 
follow the more typical D naming conventions. It's completely up to you 
though.

- Jonathan M Davis