Making all strings UTF ranges has some risk of WTF

Mon Feb 8 07:41:02 PST 2010

On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> dsimcha wrote:
>>  I personally would find this extremely annoying because most of the  
>> code I write
>> that involves strings is scientific computing code that will never be
>> internationalized, let alone released to the general public.  I  
>> basically just use
>> ASCII because it's all I need and if your UTF-8 string contains only  
>> ASCII
>> characters, it can be treated as random-access.  I don't know how many  
>> people out
>> there are in similar situations, but I doubt they'll be too happy.
>>  On the other hand, I guess it wouldn't be hard to write a simple  
>> wrapper struct on
>> top of immutable(ubyte)[] and call it AsciiString.  Once alias this  
>> gets fully
>> debugged, I could even make it implicitly convert to immutable(char)[].
>
> It's definitely going to be easy to use all sensible algorithms with  
> immutable(ubyte)[]. But even if you go with string, there should be no  
> problem at all. Remember, telling ASCII from UTF is one mask and one  
> test away, and the way Walter and I wrote virtually all related routines  
> was to special-case ASCII. In most cases I don't think you'll notice a  
> decrease in performance.

I'm in the same camp as dsimcha, I generally write all my apps assuming  
ASCII strings (most are internal tools anyways).

Can the compiler help making ASCII strings easier to use?  i.e., this  
already works:

wstring s = "hello"; // converts to immutable(wchar)[]

what about this?

asciistring a = "hello"; // converts to immutable(ubyte)[] (or  
immutable(ASCIIChar)[])
asciistring a = "\uFBCD"; // error, requires cast.

The only issue that remains to be resolved then is the upgradability that  
ascii characters currently enjoy for utf8.  I.e. I can call any utf-8  
accepting function with an ASCII string, but not an ASCII string accepting  
function with utf-8 data.

Ideally, there should be a 7-bit ASCII character type that implicitly  
upconverts to char, and can be initialized with a string literal.

In addition, you are putting D's utf8 char even further away from C's  
ASCII char.  It would be nice to separate compatible C strings from d  
strings.  At some point, I should be able to designate a function (even a  
C function) takes only ASCII data, and the compiler should disallow  
passing general utf8 data into it.  This involves either renaming D's char  
to keep source closer to C, or rewriting C function signatures to reflect  
the difference.

-Steve