toString issue
Hasan Aljudy
hasan.aljudy at gmail.com
Sun Oct 1 23:52:44 PDT 2006
Sean Kelly wrote:
>
> How about toUtf8() for classes and structs :-)
>
>
> Sean
I think there's a fundamental problem with the way D deals with strings.
The spec claims that D natively supports strings through char[], at the
same time, claims that D fully supports Unicode.
The fundamental issue is that UTF-8 is one encoding for Unicode strings,
but it's not always the best choice. Phobos mostly only deals with
char[], and mixing code that uses wchar[] with code that uses char[]
isn't very straight forward.
Consider the simple case of reading a text file and detecting "words".
To detect a word, you must first recognize letters, no .. not English
letters; letters of any language, and for that purpose, we have
isUniAlpha function. Now, If you encode the string as char[], then how
are you gonna determine whether or not the next character is a Unicode
alpha or not?
The following definitely shouldn't work:
//assuming text is char[]
for( int i = 0; i < text.length; i++ )
{
bool isLetter = isUniAlpha( text[i] );
....
}
because isUniAlpha takes a dchar parameter, and of course, because a
single char doesn't necessarily encode a Unicode character just by
itself; if you're dealing with non-English text, then most likely a
single char will only hold half the encoding for that letter.
Surprisingly, the compiler allows this kind of code, but that's not the
point. The point is, this code will never work, because char[] is not a
very good way to hold a Unicode string.
Of course there are ways around this, but they are still just "workarounds".
Should you choose wchar[] (or dchar[]) to represent strings, you will
get into all kinds of troubles dealing with phobos. The standard library
always deals with strings using char[], this includes std.string and
std.regexp, and even the Exception class. So, if you're using wchar[] to
represent strings, and you want to throw an exception, you can't just say:
# throw new Exception( myString );
because the compiler will complain (can't cast wchar[] to char[]), so
you'll need toUtf8( myString ), and you're code can quickly become full
of calls to toUtf* functions.
Personally, I think D needs a proper String class built into the
language and the standard library.
or at least, casting between the different encodings should be seamless
to the coder; just let the compiler call the appropriate toUtf* function
and allow implicit casting.
More information about the Digitalmars-d
mailing list