utf code unit sequence validity (non-)checking
Steven Schveighoffer
schveiguy at yahoo.com
Wed Dec 1 14:25:48 PST 2010
On Wed, 01 Dec 2010 07:35:15 -0500, spir <denis.spir at gmail.com> wrote:
> Hello,
>
>
> I just noted noted that D's builtin *string types do not behave the same
> way in front of invalid code unit sequences. For instance:
>
> void main () {
> assert("hæ?" == "\x68\xc3\xa6\x3f");
> // Note: removing \xa6 thus makes invalid utf8.
>
> string s1 = "\x68\xc3\x3f";
> // ==> OK, accepted -- but write-ing indeed produces "h�?".
>
> dstring s4 = "\x68\xc3\x3f";
> // ==> compile-time Error: invalid UTF-8 sequence
> }
>
> I guess this is because, while converting from string to dstring,
> meaning while decoding code units to code points, D is forced to check
> sequence validity. But this is not needed, and not done, for utf8
> string. Am I right on this?
> If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, to
> have a concrete safety difference with dstrings? I know there are utf
> checking routines in the std lib, but for dstrings one does not need no
> call them explicitely.
> (Note that this checking is done at compile-time for source code
> literals.)
I agree, the compiler should verify all string literals are valid utf.
Can you file a bugzilla enhancement if there isn't already one?
-Steve
More information about the Digitalmars-d-learn
mailing list