First Impressions
Chad J
"gamerChad\" at spamIsBad gmail.com
Fri Sep 29 12:57:00 PDT 2006
Anders F Björklund wrote:
> Chad J > wrote:
>
>> I'd like the default to be UTF. Then we can have a base of code to
>> correctly manipulate UTF strings (in phobos and language supported).
>> Writing correct ASCII manipulation routine without good library/language
>> support is a lot easier than writing good UTF manipulation routines
>> without good library/language support, and UTF will probably be used
>> much more than ASCII.
>
>
> But D already uses Unicode for all strings, encoded as UTF ?
>
> When you say "ASCII", do you mean 8-bit encodings perhaps ?
> (since all proper 7-bit ASCII are already valid UTF-8 too)
>
Probably 7-bit. Anything where the size of one character is ALWAYS one
byte. I am already assuming that ASCII is a subset or at least is
mostly a subset of UTF8. However, I talk about it in an exclusive
manner because if you handle UTF8 strings properly then the code will
probably run at least slightly slower than with ASCII-only strings.
>> Also, if we move over to full blown UTF, we won't have to give up
>> ASCII. It seems to me like the phobos std.string functions are pretty
>> much ASCII string manipulating functions (no multibyte string
>> support). So just copy those out to a seperate library, call it
>> "ASCII lib", and there's your library support for ASCII. That leaves
>> string literals, which is a slight problem, but I suppose easily fixed:
>> ubyte[] hi = "hello!"a;
>
>
> I don't understand this, why can't you use UTF-8 for this ?
>
> char[] hi = "hello!";
>
I was talking about IF we made char[] into a datatype that handles all
of those odd corner cases correctly (slices into multibyte strings, for
instance) then it will no longer be the same fast ASCII-only routines.
So for those who want the fast ASCII-only stuff, it would nice to
specify a way to make string literals such that each character in the
literal takes only one byte, without ugly casting. To get an ASCII
monobyte string from a string literal in D I would have to do the following:
ubyte[] hi = cast(ubyte[])"hello!";
hmmm, yuck.
>> Just add a postfix 'a' for strings which makes the string an ASCII
>> literal, of type ubyte[]. D arrays don't seem powerful enough to do
>> UTF manipulations without special attention, but they are powerful
>> enough to do ASCII manipulations without special attention, so using
>> ubyte[] as an ASCII string should give full language support for
>> these. Given that and ASCIILIB you pretty much have the current D
>> string manipulation capabilities afaik, and it will be fast.
>
>
> What is not powerful enough about the foreach(dchar c; str) ?
> It will step through that UTF-8 array one codepoint at a time.
>
I'm assuming 'str' is a char[], which would make that very nice. But it
doesn't solve correctly slicing or indexing into a char[]. If nothing
was done about this and I absolutely needed UTF support, I'd probably
make a class like so:
class String
{
char[] data;
...
dchar opIndex( int index )
{
foreach( int i, dchar c; data )
{
if ( i == index )
return c;
i++;
}
}
// similar thing for opSlice down here
...
}
Which is probably slower than could be done.
All in all it is a drag that we should have to learn all of this UTF
stuff. I want char[] to just work!
More information about the Digitalmars-d
mailing list