First Impressions

Fri Sep 29 12:57:00 PDT 2006

Anders F Björklund wrote:
> Chad J > wrote:
> 
>> I'd like the default to be UTF. Then we can have a base of code to
>> correctly manipulate UTF strings (in phobos and language supported).
>> Writing correct ASCII manipulation routine without good library/language
>> support is a lot easier than writing good UTF manipulation routines
>> without good library/language support, and UTF will probably be used
>> much more than ASCII.
> 
> 
> But D already uses Unicode for all strings, encoded as UTF ?
> 
> When you say "ASCII", do you mean 8-bit encodings perhaps ?
> (since all proper 7-bit ASCII are already valid UTF-8 too)
> 

Probably 7-bit.  Anything where the size of one character is ALWAYS one 
byte.  I am already assuming that ASCII is a subset or at least is 
mostly a subset of UTF8.  However, I talk about it in an exclusive 
manner because if you handle UTF8 strings properly then the code will 
probably run at least slightly slower than with ASCII-only strings.

>> Also, if we move over to full blown UTF, we won't have to give up 
>> ASCII.  It seems to me like the phobos std.string functions are pretty 
>> much ASCII string manipulating functions (no multibyte string 
>> support).  So just copy those out to a seperate library, call it 
>> "ASCII lib", and there's your library support for ASCII.  That leaves 
>> string literals, which is a slight problem, but I suppose easily fixed:
>> ubyte[] hi = "hello!"a;
> 
> 
> I don't understand this, why can't you use UTF-8 for this ?
> 
> char[] hi = "hello!";
> 

I was talking about IF we made char[] into a datatype that handles all 
of those odd corner cases correctly (slices into multibyte strings, for 
instance) then it will no longer be the same fast ASCII-only routines. 
So for those who want the fast ASCII-only stuff, it would nice to 
specify a way to make string literals such that each character in the 
literal takes only one byte, without ugly casting.  To get an ASCII 
monobyte string from a string literal in D I would have to do the following:

ubyte[] hi = cast(ubyte[])"hello!";

hmmm, yuck.

>> Just add a postfix 'a' for strings which makes the string an ASCII 
>> literal, of type ubyte[].  D arrays don't seem powerful enough to do 
>> UTF manipulations without special attention, but they are powerful 
>> enough to do ASCII manipulations without special attention, so using 
>> ubyte[] as an ASCII string should give full language support for 
>> these.  Given that and ASCIILIB you pretty much have the current D 
>> string manipulation capabilities afaik, and it will be fast.
> 
> 
> What is not powerful enough about the foreach(dchar c; str) ?
> It will step through that UTF-8 array one codepoint at a time.
> 

I'm assuming 'str' is a char[], which would make that very nice.  But it 
doesn't solve correctly slicing or indexing into a char[].  If nothing 
was done about this and I absolutely needed UTF support, I'd probably 
make a class like so:

class String
{
   char[] data;

   ...

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

   // similar thing for opSlice down here
   ...
}

Which is probably slower than could be done.

All in all it is a drag that we should have to learn all of this UTF 
stuff.  I want char[] to just work!