Signed word lengths and indexes

Thu Jun 17 09:50:48 PDT 2010

On Thu, 17 Jun 2010 06:41:33 -0400, Kagamin <spam at here.lot> wrote:
> 
> Justin Spahr-Summers Wrote:
> 
> > > 1. Ironically the issue is not in file offset's signedness. You still hit the bug with ulong offset.
> > 
> > How so? Subtracting a size_t from a ulong offset will only cause 
> > problems if the size_t value is larger than the offset. If that's the 
> > case, then the issue remains even with a signed offset.
> 
> May be, you didn't see the testcase.
> ulong a;
> ubyte[] b;
> a+=-b.length; // go a little backwards

I did see that, but that's erroneous code. Maybe the compiler could warn 
about unary minus on an unsigned type, but I find such problems rare as 
long as everyone working on the code understands signedness.

> or
> 
> seek(-b.length, SEEK_CUR, file);

I wouldn't call it a failure of unsigned types that this causes 
problems. Like I suggested above, the situation could possibly be 
alleviated if the compiler just warned about unary minus no-ops.

Like a couple others pointed out, this is just a lack of understanding 
of unsigned types and modular arithmetic. I'd say that any programmer 
should have such an understanding, regardless if their programming 
language of choice supports unsigned types or not.

> > > 2. Signed offset is two times safer than unsigned as you can detect
> > > underflow bug (and, maybe, overflow).
> > 
> > The solution with unsigned values is to make sure that they won't 
> > underflow *before* performing the arithmetic - and that's really the 
> > proper solution anyways.
> 
> If you rely on client code to be correct, you get security issue. And client doesn't necessarily use your language or your compiler. Or he can turn off overflow checks for performance. Or he can use the same unsigned variable for both signed and unsigned offsets, so checks for underflow become useless.

What kind of client are we talking about? If you're referring to 
contract programming, then it's the client's own fault if they fiddle 
around with the code and end up breaking it or violating its 
conventions.

> > > With unsigned offset you get exception if the filesystem doesn't
> > > support sparse files, so the linux will keep silence.
> > 
> > I'm not sure what this means. Can you explain?
> 
> This means that you have subtle bug.
> 
> > > 3. Signed offset is consistent/type-safe in the case of the seek function as it doesn't arbitrarily mutate between signed and unsigned.
> > 
> > My point was about signed values being used to represent zero-based 
> > indices. Obviously there are applications for a signed offset *from the 
> > current position*. It's seeking to a signed offset *from the start of 
> > the file* that's unsafe.
> 
> To catch this is the case of signed offset you need only one check. In the case of unsigned offsets you have to watch underflows in the entire application code even if it's not related to file seeks - just in order to fix issue that can be fixed separately.

Signed offsets can (truly) underflow as well. I don't see how the issue 
is any different.

> 
> > > 4. Choosing unsigned for file offset is not dictated by safety, but by stupidity: "hey, I lose my bit!"
> > 
> > You referred to 32-bit systems, correct? I'm sure there are 32-bit 
> > systems out there that need to be able to access files larger than two 
> > gigabytes.
> 
> I'm talking about 64-bit file offsets which are 64-bit on 32-bit systems too.

In D's provided interface, this is true, but fseek() from C uses C's 
long data type, which is *not* 64-bit on 32-bit systems, and this is (I 
assume) what std.stdio uses under-the-hood, making it doubly unsafe.

> As to file size limitations there's no difference between signed and
> unsigned lenghts. File sizes have no tendency stick to 4 gig value. If
> you need to handle files larger that 2 gigs, you also need to handle
> files larger than 4 gigs.

Of course. But why restrict oneself to half the available space 
unnecessarily?

> > > I AM an optimization zealot, but unsigned offsets are plain dead
> > > freaking stupid.
> > 
> > It's not an optimization. Unsigned values logically correspond to
> > disk and memory locations.
> 
> They don't. Memory locations are a *subset* of size_t values range.
> That's why you have bound checks. And the problem is usage of these
> locations: memory bus doesn't perform computations on the addresses,
> application does - it adds, subtracts, mixes signeds with unsigneds,
> has various type system holes or kludges, library design issues, used
> good practices etc. In other words, it gets a little bit complex than
> just locations.

Bounds checking does alleviate the issue somewhat, I'll grant you that. 
But as far as address computation, even if your application does none, 
the operating system still will in order to map logical addresses, which 
start at 0, to physical addresses, which also start at 0. And the memory 
bus absolutely requires unsigned values even if it needs to perform no 
computation itself.