[Dlang-study] [rcstring] Defining rcstring

Mon Feb 8 00:50:05 PST 2016

On 02/07/2016 01:47 AM, Andrei Alexandrescu wrote:
>>> D uses UTF for strings. Vivid anecdotes aside, we really can't be
>>> everything to everyone. Your friend could have written a translator to
>>> UTF in a few lines.The DNA optimization points at performance bugs in
>>> phobos that far as I know have been fixed or are fixable by rote. I
>>> think this non-UTF requirement would just stretch things too far and
>>> smacks of solving the wrong problem.
>>
>>  From a pure technical point of view you are perfectly right. But does
>> that makes the fact potential users leave dissapointed better?
> 
> What greener pastures do they leave to? We should draw a page from the
> languages that support multiple encodings seamlessly.

He has switched to some Haskell parser generator solution
(https://wiki.haskell.org/Parsec) - but I don't know anything about it
or how does it expose encoding support. My personal understanding is
that he was so frustrated by debugging mysterious parsing failures
without language/library even slightly hinting what can be wrong that
moved to different solution even if it wasn't strictly superior.

My gut feeling is that D is right in making UTF-8 default and main
supported option - but the problem is that it assumes everything is
UTF-8 too silently, without making neither library writers nor
developers recognize it soon enough.

To be more specific, consider this canonical D example:

auto processed_text =
    File("something.txt", "r")
    .byLineCopy()
    .doSomeProcessing();

It is easy and natural thing to do thus there is high chance someone
will write it without remembering doSomeProcessing() will do UTF-8
decoding internally. And a very simple addition can improve it a lot:

auto processed_text =
    File("something.txt", "r")
    .assumeUTF8() // or validateUTF8() to do early validation
                  // with throwing
    .byLineCopy()
    .doSomeProcessing();

This changes nothing in functionality and actual support of different
encodings. Yet it changes two important things:

1) Serves as a visual reminder: "Hey, this assumes UTF-8, maybe you
should consider using `File.rawRead` instead?"
2) Allows to make a choice between eagerly enforcing input is valid
UTF-8 and simply assuming it.

I won't insist if this topic looks completely out of the line - it is
not that important. But, as I have already mentioned before, it is
likely to be last chance to change something about it if it is ever
wanted to be changed.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.puremagic.com/pipermail/dlang-study/attachments/20160208/45b86f1e/attachment.sig>