Proposal: clean up semantics of array literals vs string literals
kenji hara
k.hara.pg at gmail.com
Tue Oct 2 07:02:26 PDT 2012
2012/10/2 Don Clugston <dac at nospam.com>:
> The problem
> -----------
>
> String literals in D are a little bit magical; they have a trailing \0. This
> means that is possible to write,
>
> printf("Hello, World!\n");
>
> without including a trailing \0. This is important for compatibility with C.
> This trailing \0 is mentioned in the spec but only incidentally, and
> generally in connection with printf.
>
> But the semantics are not well defined.
>
> printf("Hello, W" ~ "orld!\n");
>
> Does this have a trailing \0 ? I think it should, because it improves
> readability of string literals that are longer than one line. Currently DMD
> adds a \0, but it is not in the spec.
>
> Now consider array literals.
>
> printf(['H','e', 'l', 'l','o','\n']);
>
> Does this have a trailing \0 ? Currently DMD does not put one in.
> How about ['H','e', 'l', 'l','o'] ~ " World!\n" ?
>
> And "Hello " ~ ['W','o','r','l','d','\n'] ?
>
> And "Hello World!" ~ '\n' ?
> And null ~ "Hello World!\n" ?
>
> Currently DMD puts \0 in some cases but not others, and it's rather random.
>
> The root cause is that this trailing zero is not part of the type, it's part
> of the literal. There are no rules for how literals are propagated inside
> expressions, they are just literals. This is a mess.
>
> There is a second difference.
> Array literals of char type, have completely different semantics from string
> literals. In module scope:
>
> char[] x = ['a']; // OK -- array literals can have an implicit .dup
> char[] y = "b"; // illegal
>
> This is a big problem for CTFE, because for CTFE, a string is just a
> compile-time value, it's neither string literal nor array literal!
>
> See bug 8660 for further details of the problems this causes.
>
>
> A proposal to clean up this mess
> --------------------------------
>
> Any compile-time value of type immutable(char)[] or const(char)[], behaves a
> string literals currently do, and will have a \0 appended when it is stored
> in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.
>
> Any value of type char[], which is generated at compile time, will not have
> the trailing \0, and it will do an implicit dup (as current array literals
> do).
>
> char [] foo()
> {
> return "abc";
> }
>
> char [] x = foo();
>
> // x does not have a trailing \0, and it is implicitly duped, even though it
> was not declared with an array literal.
>
> -------------------
> So that the difference between string literals and char array literals would
> simply be that the latter are polysemous. There would be no semantics
> associated with the form of the literal itself.
>
>
> We still have this oddity:
>
>
> void foo(char qqq = 'b') {
>
> string x = "abc"; // trailing \0
> string y = ['a', 'b', 'c']; // trailing \0
> string z = ['a', qqq, 'c']; // no trailing \0
> }
>
> This is because we made the (IMHO mistaken) decision to allow variables
> inside array literals.
> This is the reason why I listed _compile time value_ in the requirement for
> having a \0, rather than entirely basing it on the type.
>
> We could fix that with a language change: an array literal which contains a
> variable should not be of immutable type. It should be of mutable type (or
> const, in the case where it contains other, immutable values).
>
> So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even
> though w is allocated on the heap).
>
> But that's a separate proposal from the one I'm making here. I just need a
> decision on the main proposal so that I can fix a pile of CTFE bugs.
Maybe your proposal is correct.
I think the key idea is *polysemous typed string literal*.
When based on the Ideal D Interpreter in my brain, the organized rule
will become like follows.
1-1) In semantic level, D should have just one polysemous string
literal, which is "an array of char".
1-2) In token level, D has two represents for the polysemous string
literal, they are "str" and ['s','t','r'].
2) The polysemous string literl is implicitly convertible to
[wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is
not need, because immutable([wd]?char)[] is implicitly convertible to
them).
3) The concatenation result between polysemous literals is still
polysemous, but its representation is different based on the both side
of the operator.
"str" ~ "str"; // "strstr"
"str" ~ ['s','t','r']; // ['s','t','r','s','t','r']
"str" ~ 's'; // "strs"
['s','t','r'] ~ 's'; // ['s','t','r','s']
"str" ~ null; // "str"
['s','t','r'] ~ null; // ['s','t','r']
4) After semantics _and_ optimization, polysemous string literal which
represented as like
4-1) "str" is typed as immutable([wd]?char)[] (The char type is
depends on the literal suffix).
4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is
depends on the common type of its elements).
5) In object file generating phase, string literal which typed as
5-1) immutable([wd]?)char[] is stored in the executable and
implicitly terminated with \0.
5-2) [wd]?char[] are stored in the executable as the original image
and implicitly 'dup'ed in runtime.
----
Additionally, in following case, both concatenation should generate
polysemous string literals in CT and RT.
Because, after concatenation of chars and char arrays, newly allocated
strings are *purely immutable* value and implicitly convertible to
mutable.
immutable char ic = 'a';
pragma(msg, typeof(['s', 't', ic, 'r'])); // prints const(char)[]
immutable(char)[] s = ['s', 't', ic, 'r']; // BUT, should be allowed
char mc = 'a';
pragma(msg, typeof("st"~mc~"r")); // prints const(char)[]
char[] s = "st"~mc~"r"; // BUT, should be allowed
Kenji Hara
More information about the Digitalmars-d
mailing list