toString issue

Hasan Aljudy hasan.aljudy at gmail.com
Sun Oct 1 23:52:44 PDT 2006



Sean Kelly wrote:
> 
> How about toUtf8() for classes and structs :-)
> 
> 
> Sean

I think there's a fundamental problem with the way D deals with strings.
The spec claims that D natively supports strings through char[], at the 
same time, claims that D fully supports Unicode.
The fundamental issue is that UTF-8 is one encoding for Unicode strings, 
but it's not always the best choice. Phobos mostly only deals with 
char[], and mixing code that uses wchar[] with code that uses char[] 
isn't very straight forward.

Consider the simple case of reading a text file and detecting "words". 
To detect a word, you must first recognize letters, no .. not English 
letters; letters of any language, and for that purpose, we have 
isUniAlpha function. Now, If you encode the string as char[], then how 
are you gonna determine whether or not the next character is a Unicode 
alpha or not?

The following definitely shouldn't work:
//assuming text is char[]
for( int i = 0; i < text.length; i++ )
{
     bool isLetter = isUniAlpha( text[i] );
     ....
}

because isUniAlpha takes a dchar parameter, and of course, because a 
single char doesn't necessarily encode a Unicode character just by 
itself; if you're dealing with non-English text, then most likely a 
single char will only hold half the encoding for that letter.
Surprisingly, the compiler allows this kind of code, but that's not the 
point. The point is, this code will never work, because char[] is not a 
very good way to hold a Unicode string.
Of course there are ways around this, but they are still just "workarounds".

Should you choose wchar[] (or dchar[]) to represent strings, you will 
get into all kinds of troubles dealing with phobos. The standard library 
always deals with strings using char[], this includes std.string and 
std.regexp, and even the Exception class. So, if you're using wchar[] to 
represent strings, and you want to throw an exception, you can't just say:
# throw new Exception( myString );
because the compiler will complain (can't cast wchar[] to char[]), so 
you'll need toUtf8( myString ), and you're code can quickly become full 
of calls to toUtf* functions.

Personally, I think D needs a proper String class built into the 
language and the standard library.

or at least, casting between the different encodings should be seamless 
to the coder; just let the compiler call the appropriate toUtf* function 
  and allow implicit casting.



More information about the Digitalmars-d mailing list