suggestion of type: ustring

Jonathan M Davis jmdavisProg at gmx.com
Sun Mar 20 06:14:03 PDT 2011


> D's string is supposed to be utf8 encoded, however, the following code
> compiles and runs with no error:
> 
>   string s = "\xff"; // s is invalid
>   writeln(s);
>   fileStream.writeLine(s);
> 
> In order to make sure only valid utf8 string is used in the system,
> validating is needed everywhere, e.g.
> 
>   string cut3bytes(string s)
>   in {validate(s);}
>   out(result} {validate(result);}
>   body {return s.length > 3 ? s[0..3] : s;}
> 
> I think it will be better if D has a ustring type to do all the validating
> job. e.g.
> 
>   ustring s = "0xFF";  // compile error
> 
>   char[] c = [0xFF];
>   ustring s = c.idup;  // throw UtfException
> 
>   ustring s1 = "\xc2\xa2";
>   ustring s2 = s1[0..1];  // throw UtfException
> 
> So the above example can be simplified to:
> 
>   ustring cut3bytes(ustring s)
>   {return s.length > 3 ? s[0..3] : s;}

It would be prohibitively expensive to be constantly validating strings. You 
validate them at the point that they're created, and then you generally don't 
worry about. Doing otherwise would be expensive. Some functions do check that 
a string is properly encoded, but most don't. If you want a string type that 
actually validates on every operation, feel free to define a struct which 
holds a string internally and has all of the appropriate overloaded operators 
so that it's a range of dchar and whatnot. But you're going to have a hard 
time convincing folks that such a type should be in Phobos, and there's no way 
that it would make it into the language itself.

And honestly, how often do you have to worry about invalid strings? As long as 
you check them when they're created, you won't generally have problems with 
invalid strings, and it's a lot less expensive than constantly checking their 
validity whenever you do anything with them.

- Jonathan M Davis


More information about the Digitalmars-d mailing list