DIP 1026---Deprecate Context-Sensitive String Literals---Community Review Round 1

Wed Dec 4 01:26:24 UTC 2019

On Tue, Dec 03, 2019 at 09:34:26PM +0000, Dennis via Digitalmars-d wrote:
> On Tuesday, 3 December 2019 at 18:34:22 UTC, H. S. Teoh wrote:
[...]
> D has 6 types of string literals ("double quote" `back tick` r"r
> string" q{tokens} 	q"<brackets>" q”EOS ident EOS”) with 3 encoding
> options (char, wchar, dchar).

Walter has admitted that having 3 encodings, with the corresponding 3
string types, was a "miss" in D's design, and that he should have just
stuck with UTF-8. UTF-16 is occasionally useful for interfacing with
Windows APIs, but that's pretty narrow and contained, and nobody uses
UTF-32 strings in practice.  In practice, I've not seen many examples of
non-UTF-8 strings in D code.

I admit D having 6 types of string literals is excessive, but as
somebody has already said, even if something was a mistake in
retrospect, doesn't necessarily mean that removing it isn't also a
mistake. Because now you have the weight of existing code weighing
against removing it.

And just for a bit more perspective, Python also has heredoc syntax, so
does Perl, PHP, bash, and probably many others. If heredocs were really
such a bad idea, why are people putting them into so many languages,
over and over again? Perhaps, just perhaps, there are use cases for them
that this DIP has overlooked / underrepresented?  I don't hear people
clamoring for removing heredocs from Python, for example, so I'm really
having a hard time understanding why we're having this debacle right
now.

> For comparison, Java has one. C# has two + interpolated strings.
> There is a DIP for adding interpolated strings to D.

That DIP seems dead in the water though. The author has vanished and
nobody has taken up the reins.

> People are mentioning how D keeps adding adding features and is on a
> road towards C++ complexity. There is precedent for removing barely
> used features (see e.g. octal, escape or hexstring literals  on
> https://dlang.org/deprecate.html).

Actually, I was a bit disappointed with the removal of hexstring
literals, but the issue is somewhat more complex. The problem with
hexstring literals was that it was some kind of half-hearted attempt at
supporting literal hexadecimal data, because it coerces the result into
string rather than ubyte[]. The hexstring *syntax* was ideal for
entering hex data, but then having the result coerced into string seemed
to me like a backwards misfit. If it had produced a ubyte[] then there
would have been much more reason to keep it in the language, since
occasionally it's very useful to be able to enter blocks of binary data
in hex.  As to why the original design produced a string rather than a
ubyte[], I can only speculate. Perhaps it was meant as a poor man's way
of writing a Unicode string without a Unicode-aware keyboard / input
method?  Who knows.  In any case, *that* use case is rendered completely
moot by the \u.... and \U........ escape sequences in your regular
double-quoted string.  The ubyte[] use case is arguably implementable in
a CTFE parser the same way octal literals can, and so hexstrings went
the way of the dodo.

> And of course there are always users that remorse the removal of their
> favorite feature, but in the long run everyone benefits from a simpler
> language.
> 
> As for your use case of code generation, I'm having trouble relating
> to it.  I happened to write some code generation algorithms myself
> recently, and could do fine with q{} strings for large templates and
> regular "" or `` string for small token parts like "switch(".

q{} works well for emitting *D code*.  Not so well for non-D code.

> - Do you truly have 50,000 character string literals in your code base?

No, but I do have a number of large multi-line string literals that
simply look best / are most maintainable in heredoc format.

> - Can't you use bracket delimited strings instead, q"<like this?>"

Heredoc syntax is better because the ending delimiter is obvious. When
the string literal spans multiple lines, single-character terminating
delimiters just aren't the best way to do it.

> - If accidental early termination in huge string literals is a
> concern, even an identifier-delimited string isn't always safe. Can't
> you use an `import()` statement on an external text file?

Identifier-delimited string is safe because the literal is typed in
directly as code, so you already know beforehand what words might appear
or not appear in it, and you already know what will *never* appear in
the string.  It isn't as though I'm copy-n-pasting arbitrary text from
arbitrary input files into my code just for fun.

String imports require creating an extra file to contain the string, and
requires running the compiler with -J + the right path(s), all of which
are extra hurdles to jump through.  It's the same thing with external
unittests vs. unittest blocks that you can just write inline. It's
*possible*, but inconvenient and liable to go out-of-sync as you modify
the code.

> - If those 50,000 characters are code and you value readability of it,
> isn't it a problem that there is no syntax highlighting in a q"EOS
> EOS" string?

As I said, I don't use a syntax highlighter.  Also, any attempt to
highlight is moot if the string contains code of a different language
(see below for my use cases).

> - Can you maybe post an example of some of your q"EOS EOS" strings
> used for code generation?

I feel a single example will not adequately convey my point. Here's a
list of use cases I use heredocs for (in no particular order):

1) Generating HTML snippets
2) Generating PovRay scene description snippets
3) Generating D code snippets
4) Generating snippets of a DSL I use for generating geometric models
5) Generating boilerplate for input data to an external convex hull
   solver (has its own peculiar syntax)
6) Generating GLSL shader code snippets
7) Generating Java code snippets
8) Command line usage descriptions

Some of this code is somewhat old but is actively used as infrastructure
for my current projects, and having to go back to rewrite heredocs just
because of some ivory tower ideal of "cleaning up useless literals in D"
is rather distasteful to me, you understand, esp. since I don't even use
syntax highlighting in the first place, so this is just pure work for
zero benefit.  If we were still in the early stages of D development,
then sure, go ahead and nuke heredocs if you have very good reasons for
it, but I'm not about to go rewriting code for (1) to (8) now, not when
there's basically zero benefit in doing so.

> > As for poor syntax highlighting as mentioned in the DIP, how is that
> > even a problem with the language?! It's a strawman argument based on
> > skewed data obtained from badly-written lexers that don't actually
> > lex D code correctly. It should be the syntax highlighter that
> > should be fixed, rather than deprecate an actually useful feature in
> > the language.
> 
> The thing is, these string literals simply can't be expressed in e.g.
> a PEG grammar.

?!  Can't you just use a custom lexer with your PEG grammar?

> The D's grammar is one complexity class higher than needed just for
> this one relatively obscure string literal. Sure you can say "not our
> problem, those tooling authors just need to account for D's
> complexity", but I don't think that is useful for D's tooling
> ecosystem.
[...]

Then isn't the solution simply to write a self-contained heredoc parsing
function, put it in a dub package, and let everyone reuse it? Then
nobody will have to write it for themselves again. Problem solved.

(If it's even that complex to begin with. As I said, we already have 2
working examples of syntax highlighter code that work fine with
heredocs. It's not as though D invented heredocs; they have been around
since the early days of the Unix shell, and people have been writing
parsing code for it for a long time. Its supposed "complexity" is really
blown out of proportion here.)

This whole debacle feels like heredocs are being singled out as a
scapegoat in a misguided quest to "simplify the language".  Like we're
grasping at straws because we're unable to tackle the bigger issues, so
here's a convenient simple target we can shoot and kill and feel good
about ourselves that we're finally making progress.  Talking about
straining out the gnat and swallowing the camel.

T

-- 
"I'm not childish; I'm just in touch with the child within!" - RL