The Nullity Of strings and Its Meaning

Sat Jul 8 16:12:15 PDT 2017

On Saturday, July 8, 2017 5:16:51 PM MDT kdevel via Digitalmars-d-learn 
wrote:
> Yesterday I noticed that std.uri.decodeComponent does not
> 'preserve' the
> nullity of its argument:
>
>     1 void main ()
>     2 {
>     3    import std.uri;
>     4    string s = null;
>     5    assert (s is null);
>     6    assert (s.decodeComponent);
>     7 }
>
> The assertion in line 6 fails. This failure gave rise to a more
> general
> investigation on strings. After some research I found that one
> "cannot implicitly convert expression (s) of type string to bool"
> as in
>
>     1 void main ()
>     2 {
>     3    string s;
>     4    bool b = s;
>     5 }
>
> Nonetheless in certain boolean contexts strings convert to bool
> as here:
>
>     1 void main ()
>     2 {
>     3    import std.stdio;
>     4    string s; // equivalent to s = null
>     5    writeln (s ? true : false);
>     6    s = "";
>     7    writeln (s ? true : false);
>     8 }
>
> The code prints
>
>     false
>     true
>
> to the console. This lead me to the insight, that in D there are
> two
> distinct kinds of empty strings: Those having a ptr which is null
> and
> the other. It seems that this ptr nullity not only determines
> whether
> the string compares equal to null in an IdentityExpression [1]
> but also
> the result of the above mentioned conversion in the boolean
> context.
>
> I wonder if this distinction is meaningful and---if not---why it
> is
> exposed to the application programmer so prominently.
>
> Then today I found this piece of code
>
>     1 void main ()
>     2 {
>     3    string s = null;
>     4    string t = "";
>     5    assert (s is t);
>     6 }
>
> which, according to the wording in [1]
>
>    "For static and dynamic arrays, identity is defined as
> referring to
>     the same array elements and the same number of elements."
>
> shall succeed but its assertion fails [2]. I anticipate the
> implementation compares the ptrs even in the case of zero
> elements.
>
> A last example of 'deviant behavior' I found is this:
>
>      1 import std.stdio;
>      2 import std.file;
>      3 void main ()
>      4 {
>      5    string s = null;
>      6    try
>      7       mkdir (s);
>      8    catch (Exception e)
>      9       e.msg.writeln;
>     10
>     11    s = "";
>     12    try
>     13       mkdir (s);
>     14    catch (Exception e)
>     15       e.msg.writeln;
>     16 }
>
> Using DMD v2.073.2 the first expression terminates the programm
> with a
> segmentation fault. With 2.074.1 the program prints
>
>     : Bad address
>     : No such file or directory
>
> I find that a bit confusing.
>
> [1] https://dlang.org/spec/expression.html#identity_expressions
> [2] https://issues.dlang.org/show_bug.cgi?id=17623

A dynamic array in D is essentially

struct DynamicArray(T)
{
    size_t length;
    T* ptr;
}

That's not _exactly_ what it is at the moment (it actually does stuff with
void* rather than templates unfortunately), but essentially, that's what it
is and what it behaves like.

In the case of dyanamic arrays, null is a dynamic array whose ptr is null
and whose length is 0.

The empty property for arrays checks whether the length of the array is 0.
So, any array with a length of 0 (regardless of its ptr) is considered
empty.

The is expression checks for bitwise equality. So,

arr is null

checks for whether the array has a null ptr and a 0 length. In _most_
circumstances, that's equvialent to checking that the array's ptr is null,
but if you do something screwy with unitialized memory, then you could end
up with a ptr value of null and a non-zero length, and

arr is null

would be false. The == expression, on the other, hand checks that the
elements are equal. So, it does something similar to

if(lhs.length != rhs.length)
    return false;
for(size_t i = 0; i < lhs.length; ++i)
{
    if(lhs.ptr[i] != rhs.ptr[i])
        return false;
}
return true;

So, if the lengths are 0, no iterating happens, and the two arrays are
considered equal. This means that a null array is equal to any other empty
array, regardless of the value of ptr. It's also why I would consider

arr == null

to be a code smell. IMHO, if you want to check for empty, then you should
use the empty property or check length directly, since those are clear about
your intent, whereas with

arr == null

you always have the question of whether they should have used an is
expression or whether they were simpy checking for an empty array.

If you understand all of this, it is perfectly possible to write code which
treats null arrays as distinct from empty arrays. However, it's _very_ easy
to get into a situation where you have an empty array rather than a null
one.  Pretty much as soon as you do anything to a null array other than pass
it around or compare it, trusting that it's still null can get error-prone.
And that's why a number of folks think that it's just plain error-prone to
try and treat null arrays as special - but some folks who understand the
issues continue to do so anyway, because they know enough to make it work
and consider the distinction valuable.

Personally, I think that it can make sense to have a function explicitly
return null to indicate something, but beyond that, I'd actually consider
using std.typecons.Nullable to make the whole thing clear, even if it is a
bit dumb to have to wrap a nullable type in a Nullable to treat it as null.

As for conversions to bool, not much implcitly converts to bool - dynamic
arrays included. However, conditional expressions in if statements, loops,
ternary expressions, and assertions actually insert an invisible, explicit
cast. So, even though the conversion _looks_ implicit, it's actually
explicit. So,

if(cond)
{
}

is actually

if(cast(bool)cond)
{
}

For user-defined types, that means that the way to affect how they're
treated in condition expressions is to overload opCast to bool. For,
built-in types, the result varies depending on how it was decided to casting
that type to bool would work. For pointers,

cast(bool)ptr

becomes

ptr !is null

which makes a lot of sense. Unfortunately, because dynamic arrays were just
pointers in C, D has historically treated dynamic arrays as pointers under
certain circumstances and implictly converted them to value of their ptr
property. Fortunately, in many cases, that has been fixed, and the compiler
has gotten stricter. Unforunately, however, it is still the case that
casting a dynamic array to bool checks its ptr value for null. This works
fine if you know what  you're doing but is frequently surprising to folks
and is arguably error-prone. It _was_ temporarily fixed at one point by
deprecating using arrays in conditional expressions, but some major D
contributors (Andrei included) who understood how to correctly treat null,
dynamic arrays as special did not like the change, and it was reverted.

So, basically, you should be _very_ wary of ever using a dynamic array in a
conditional expression directly. If you know what you're doing, it can be
done correctly, but it's error prone, and it's arguably a code smell,
because folks reading your code don't necessarily know that you know what
you're doing well enough to get it right.

- Jonathan M Davis