Proposal: clean up semantics of array literals vs string literals

Tue Oct 2 07:02:26 PDT 2012

2012/10/2 Don Clugston <dac at nospam.com>:
> The problem
> -----------
>
> String literals in D are a little bit magical; they have a trailing \0. This
> means that is possible to write,
>
> printf("Hello, World!\n");
>
> without including a trailing \0. This is important for compatibility with C.
> This trailing \0 is mentioned in the spec but only incidentally, and
> generally in connection with printf.
>
> But the semantics are not well defined.
>
> printf("Hello, W" ~ "orld!\n");
>
> Does this have a trailing \0 ? I think it should, because it improves
> readability of string literals that are longer than one line. Currently DMD
> adds a \0, but it is not in the spec.
>
> Now consider array literals.
>
> printf(['H','e', 'l', 'l','o','\n']);
>
> Does this have a trailing \0 ? Currently DMD does not put one in.
> How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?
>
> And "Hello " ~ ['W','o','r','l','d','\n']   ?
>
> And "Hello World!" ~ '\n' ?
> And  null ~ "Hello World!\n" ?
>
> Currently DMD puts \0 in some cases but not others, and it's rather random.
>
> The root cause is that this trailing zero is not part of the type, it's part
> of the literal. There are no rules for how literals are propagated inside
> expressions, they are just literals. This is a mess.
>
> There is a second difference.
> Array literals of char type, have completely different semantics from string
> literals. In module scope:
>
> char[] x = ['a'];  // OK -- array literals can have an implicit .dup
> char[] y = "b";    // illegal
>
> This is a big problem for CTFE, because for CTFE, a string is just a
> compile-time value, it's neither string literal nor array literal!
>
> See bug 8660 for further details of the problems this causes.
>
>
> A proposal to clean up this mess
> --------------------------------
>
> Any compile-time value of type immutable(char)[] or const(char)[], behaves a
> string literals currently do, and will have a \0 appended when it is stored
> in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.
>
> Any value of type char[], which is generated at compile time, will not have
> the trailing \0, and it will do an implicit dup (as current array literals
> do).
>
> char [] foo()
> {
>     return "abc";
> }
>
> char [] x = foo();
>
> // x does not have a trailing \0, and it is implicitly duped, even though it
> was not declared with an array literal.
>
> -------------------
> So that the difference between string literals and char array literals would
> simply be that the latter are polysemous. There would be no semantics
> associated with the form of the literal itself.
>
>
> We still have this oddity:
>
>
> void foo(char qqq = 'b') {
>
>    string x = "abc";            // trailing \0
>    string y = ['a', 'b', 'c'];  // trailing \0
>    string z = ['a', qqq, 'c'];  // no trailing \0
> }
>
> This is because we made the (IMHO mistaken) decision to allow variables
> inside array literals.
> This is the reason why I listed _compile time value_ in the requirement for
> having a \0, rather than entirely basing it on the type.
>
> We could fix that with a language change: an array literal which contains a
> variable should not be of immutable type. It should be of mutable type (or
> const, in the case where it contains other, immutable values).
>
> So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even
> though w is allocated on the heap).
>
> But that's a separate proposal from the one I'm making here. I just need a
> decision on the main proposal so that I can fix a pile of CTFE bugs.

Maybe your proposal is correct.
I think the key idea is *polysemous typed string literal*.

When based on the Ideal D Interpreter in my brain, the organized rule
will become like follows.

1-1) In semantic level, D should have just one polysemous string
literal, which is "an array of char".
1-2) In token level, D has two represents for the polysemous string
literal, they are "str" and ['s','t','r'].

2) The polysemous string literl is implicitly convertible to
[wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is
not need, because immutable([wd]?char)[] is implicitly convertible to
them).

3) The concatenation result between polysemous literals is still
polysemous, but its representation is different based on the both side
of the operator.

   "str" ~ "str";         // "strstr"
   "str" ~ ['s','t','r']; // ['s','t','r','s','t','r']
   "str" ~ 's';           // "strs"
   ['s','t','r'] ~ 's';   // ['s','t','r','s']
   "str" ~ null;          // "str"
   ['s','t','r'] ~ null;  // ['s','t','r']

4) After semantics _and_ optimization, polysemous string literal which
represented as like
 4-1) "str" is typed as immutable([wd]?char)[] (The char type is
depends on the literal suffix).
 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is
depends on the common type of its elements).

5) In object file generating phase, string literal which typed as
  5-1) immutable([wd]?)char[] is stored in the executable and
implicitly terminated with \0.
  5-2) [wd]?char[] are stored in the executable as the original image
and implicitly 'dup'ed in runtime.

----
Additionally, in following case, both concatenation should generate
polysemous string literals in CT and RT.
Because, after concatenation of chars and char arrays, newly allocated
strings are *purely immutable* value and implicitly convertible to
mutable.

immutable char ic = 'a';
pragma(msg, typeof(['s', 't', ic, 'r']));   // prints const(char)[]
immutable(char)[] s = ['s', 't', ic, 'r'];  // BUT, should be allowed

char mc = 'a';
pragma(msg, typeof("st"~mc~"r"));   // prints const(char)[]
char[] s = "st"~mc~"r";             // BUT, should be allowed

Kenji Hara