Found one (the?) ARM bug: (Pointer) aliasing issues

Sun Nov 3 02:11:55 PST 2013

Am Sun, 3 Nov 2013 02:10:20 +0000
schrieb Iain Buclaw <ibuclaw at ubuntu.com>:

> On 2 November 2013 20:07, Johannes Pfau <nospam at example.com> wrote:
> 
> > I think I finally found the root cause for one of the remaining ARM
> > bugs (the codegen bug which only appears on -O2 or higher).
> >
> > What happened in the test case was that the gcc backend illegally
> > moved a read to a memory location before the write. It turns out
> > that GCC assumed that the read and write locations were in
> > different alias sets and therefore could not possibly reference the
> > same memory location. One alias set was for 'ubyte[]' and one for
> > 'char[]'.
> >
> > The code that triggers this issue is in std.algorithm.find:
> > --------
> > R1 find(...)(R1 haystack, R2 needle) if (/* are strings*/)
> > {
> >     return cast(char[]) .find!(ubyte[], ubyte[])
> >         (cast(ubyte[]) haystack, cast(ubyte[])needle);
> > }
> > --------
> > (Real code:
> >
> > https://github.com/D-Programming-GDC/GDC/blob/master/libphobos/src/std/algorithm.d#L3555
> > )
> >
> > The generic emitted by gdc:
> > --------
> > return <retval> = *(struct  *) &find (*(struct  *) &haystack,
> > *(struct *) &needle);
> > --------
> >
> > Or in the raw form:
> > @44 = ubyte[]
> > @11 = string
> > --------
> > @11     record_type      name: @17      size: @18      algn: 32
> >                          tag : struct   flds: @19
> > @31     pointer_type     size: @15      algn: 32       ptd : @11
> > @40     pointer_type     size: @15      algn: 32       ptd : @44
> > @41     call_expr        type: @44      fn  : @45      0   : @46
> >                          1   : @47
> > @44     record_type      name: @49      size: @18      algn:
> >                          tag : struct   flds: @50
> > @46     indirect_ref     type: @44      op 0: @53
> > @47     indirect_ref     type: @44      op 0: @54
> > @53     nop_expr         type: @40      op 0: @58
> > @54     nop_expr         type: @40      op 0: @59
> > @58     addr_expr        type: @31      op 0: @30
> > @59     addr_expr        type: @31      op 0: @61
> > --------
> >
> > Here it's easy to see that we essentially generate this code:
> > *(cast(ubyte[]*)(&haystack))
> > and this is AFAIK a violation of the aliasing rules.
> >
> > This problem is not observed at -O1 as the function call generates a
> > new stackframe and we therefore have two distinct memory locations.
> > But with -O2 inlining removes this copy and we now have variables
> > referencing the same memory location with type ubyte[] and char[].
> >
> > I can not provide a reduced testcase as the smallest changes in
> > compiler codegen (gcc version, gdc commit) or the test case can hide
> > this issue. It's always reproducible with test15.d in the test
> > suite. (In older gdc versions the test case does not segfault but
> > it produces wrong output at -O2. This can be a very subtle
> > difference)
> >
> > But here's a working code snippet which illustrates the generic
> > generated by GDC:
> >
> > -----------------
> > void main()
> > {
> >         char[] in1 = "Test".dup;
> >         char[] in2 = "Test2".dup;
> >
> >         char[] result = cast(char[])find(cast(ubyte[])in1,
> >             cast(ubyte[])in2);
> > }
> >
> > ubyte[] find(ubyte[] a, ubyte[] b)
> > {
> >     return a;
> > }
> > -----------------
> >
> >
> > So the important question here: Is this a bug in GDC codegen or is
> > the code in std.algorithm invalid? According to
> > http://dlang.org/expression.html
> > "The cast is done as a type paint" so this could indeed be
> > interpreted as a user mistake. But OTOH that page also talks about
> > a runtime check of the array .lengths which is clearly missing here.
> >
> > I'm also wondering if that runtime check can actually fix this
> > aliasing issue or if it can come up again if the runtime check
> > itself is inlined?
> >
> 
> 
> I guess D does not have any aliasing rules.  Question: does this
> occur when you pass -fno-strict-aliasing?
> 
> By default, GDC defines -fstrict-aliasing (which is also enabled by
> default in -O2 in GCC), which is nice to have, but if it's proving to
> be non-trivial, we can always disallow the optimisation being done in
> the middle-end.  See the LANG_HOOKS_GET_ALIAS_SET langhook - last
> time I checked, returning 0 disables aliasing rules from taking
> effect.
> 
> Regards.

Yes, -fno-strict-aliasing also fixes this problem. I'm starting a fresh
build on ARM right now to see if this fixes the remaining problems in
the test suite.

I think we should really disable strict aliasing. While it may provide
some nice optimization it's almost impossible for the user to do type
punning / casting with strict aliasing rules. I looked into the gcc
aliasing implementation (alias.c) and the only safe way is (as the
documentation says) to use unions. But then all access has to go
through the union. The often proposed helper functions which just cast
a value through the union are therefore illegal and will fail under
certain circumstances (e.g. if they're inlined). As optimization
(inlining, dead code / dead store elimination and target architecture)
also heavily changes the effects of pointer aliasing it's almost
impossible for a developer to write a safe type cast.

(See also http://d.puremagic.com/issues/show_bug.cgi?id=10750)