Creeping Bloat in Phobos

Mon Sep 29 06:30:08 PDT 2014

Am Sun, 28 Sep 2014 16:48:53 -0700
schrieb Walter Bright <newshound2 at digitalmars.com>:

> Regardless, the replacement character method is widely used and accepted 
> practice. There's no reason to throw.

I feel a bit uneasy about this. Could it introduce a silent
loss of information? While the replacement character method is
widely used, so is the error method. APIs typically provide
flags for this.

MultiByteToWideChar: The flag MB_ERR_INVALID_CHARS decides
     whether the API errors out or drops invalid chars.
ICU: You set up an "error callback". The default replaces
     invalid characters with the Unicode substitution
     character. (We are talking about characters from
     arbitrary charsets like Amiga to Unicode.)
     Other prefab error handlers drop the invalid character or
     error out.
iconv: By default it errors out at the location where an
     incomplete or invalid sequence is detected. With the
     "//IGNORE" flag, it will silently drop invalid characters.

I'm not opposed to the decision, but I expected the reasoning
to me more along the line of: 
`string` is per definition correct UTF-8. Exception or
substitution character is of no concern to a correctly
written D program, because decoding errors wont happen.
Validate and treat all input as ubyte[]. (Especially when
coming from a Windows console)
or:
We may lose information in the conversion, but it's the only
practical way to reach the @nogc goal. And we are far from
having reference-counted Exceptions.
instead of:
Many people use the substitution character [in unspecified
context], so it follows that it can replace Exceptions for
Phobos' string-dchar decoding. :)

-- 
Marco