Multi-file byte comparison tool. What would you have done differently?
Kai Meyer
kai at unixlords.com
Fri Aug 5 10:50:12 PDT 2011
On 08/05/2011 11:13 AM, Jonathan M Davis wrote:
>> On 08/05/2011 03:02 AM, Pelle wrote:
>>> On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer<kai at unixlords.com> wrote:
>>>> I have a need for detecting incorrect byte sequences in multiple files
>>>> (>2) at a time (as a part of our porting effort to new platforms.)
>>>> Ideally the files should be identical for all but a handful of byte
>>>> sequences (in a header section) that I can just skip over. I thought
>>>> this would be a fun exercise for my D muscles. I found success
>>>> creating a dynamic array of structs to keep information about each
>>>> file passed in via command-line parameters. I'll append the code at
>>>> the end (and I'm sure it'll get mangled in the process...)
>>>>
>>>> (I'm not one for coming up with creative names, so it's SOMETHING)
>>>> Then I loop around a read for each file, then manually run a for loop
>>>> from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array
>>>> (one for each of the files opened), and run a function to ensure all
>>>> values in the size_t array are the same. If not, I compare each ubyte
>>>> value (via the byte_union) to determine which bytes are not correct by
>>>> adding each byte to a separate array, and comparing each value in that
>>>> array, printing the address and values of each bad byte as I encounter
>>>> them.
>>>>
>>>> This appears to work great. Some justifications:
>>>> I used size_t because I'm under the impression it's a platform
>>>> specific size that best fits into a single register, thus making
>>>> comparisons faster than byte-by-byte.
>>>> I used a union to extract the bytes from the size_t
>>>> I wanted to create a SOMETHING for each file at run-time, instead of
>>>> only allowing a certain number of SOMETHINGS (either hard coded, or a
>>>> limit).
>>>> Originally I wrote my own comparison function, but in my search for
>>>> something more functional, I tried out std.algorithm's count. Can't
>>>> say I can tell if it's better or worse.
>>>>
>>>> Features I'll probably add if I have to keep using the tool:
>>>> 1) Better support for starting points and bytes to read.
>>>> 2) Threshold for errors encountered, perferrably managed by a
>>>> command-line argument.
>>>> 3) Coalescing error messages in sequential byte sequences.
>>>>
>>>> When I run the program, it's certainly I/O bound at 30Mb/s to an
>>>> external USB drive :).
>>>>
>>>> So the question is, how would you make it more D-ish? (Do we have a
>>>> term analogous to "pythonic" for D? :))
>>>>
>>>>
>>>> Code:
>>>>
>>>> import std.stdio;
>>>> import std.file;
>>>> import std.conv;
>>>> import std.getopt;
>>>> import std.algorithm;
>>>>
>>>> enum BLOCK_SIZE = 1024;
>>>> union byte_union
>>>> {
>>>> size_t val;
>>>> ubyte[val.sizeof] bytes;
>>>> }
>>>> struct SOMETHING
>>>> {
>>>> string file_name;
>>>> size_t size_bytes;
>>>> File fd;
>>>> byte_union[BLOCK_SIZE] bytes;
>>>> }
>>>
>>> I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion,
>>> Something, sizeBytes, etc.
>>
>> I haven't used TypeNames or nonTypeNames. Do have some reference
>> material for me? Searching http://www.d-programming-language.org/ didn't
>> give me anything that sounds like what you're talking about.
>
> I believe that he was talking about how you named your variables and type
> names. Normall in D, user-defined types use pascal case (e.g. Something and
> ByteUnion, not SOMETHING and byte_union), and everything else uses camelcase
> (e.g. blockSize and fileName, not BLOCK_SIZE and file_name). It's an issue of
> style. You can name your variables however you'd like to, but what you're
> doing doesn't match what most of the people around here do, which can make it
> harder to read the code that you post. He was simply suggesting that you
> follow the more typical D naming conventions. It's completely up to you
> though.
>
> - Jonathan M Davis
Thanks :) That's what I was asking for. I've got some programming habits
that need to be broken if I want to increase my D-Factor!
More information about the Digitalmars-d-learn
mailing list