toString vs. toUtf8

Sean Kelly sean at f4.ca
Mon Nov 19 13:06:15 PST 2007


I was looking at converting Tango's use of toUtf8 to toString today and 
ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
member function for returning char strings is consitent with all use of 
string operations in Tango.  Routines that return wchar strings are 
named toUtf16 whether they are members of the String class or whether 
they are intended to perform UTF conversions, and so on.  Thus, the 
convention is consitent and pervasive.

What I discovered during a test conversion of Tango was that converting 
all uses of toUtf8 to toString /except/ those intended to perfom UTF 
conversions reduced code clarity, and left me unsure as to which name I 
would actually use in a given situation.  For example, there is quite a 
bit of code in the text and io packages which convert an arbitrary type 
to a char[] for output, etc.  So by making this change I was left with 
some conversions using toString and others using toUtf8, toUtf16, and 
toUtf32, not to mention the fromXxx versions of these same functions. 
As this is template code, the choice between toString and toUtf8 in a 
given situation was unclear.  Given this, I decided to look to Phobos 
for model to follow.

What I found in Phobos was that it suffers from the same situation as I 
found Tango in during my test conversion.  Routines that convert any 
type by a string to a char[] are named toString, while the string 
equivalent is named toUTF8.  Given this, I surmised that the naming 
convention in D is that all strings are assumed to be Unicode, except 
when they're not.  String literals are required to be Unicode, foreach 
assumes strings to be UTF encoded when performing its automatic 
conversions, and all of the toString functions in std.string assume 
UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?

As near as I can tell, the reason for text conversion routines to be 
named differently is to simplify the use of routines which covert to 
another format.  std.windows.charset, for example, has a routine called 
toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
about this is that it suggests that while the transport mechanism for 
strings is the same in each case (both routines return a char[], ie. a 
string), the underlying encoding is different.  Thus there seems a clear 
disconnect between the name of the transport mechanism (string), and 
routines that generate them.  With this in mind, I begin to question the 
point of having toString as the common name for routines that generate 
char strings.  The encoding clearly matters in some instances and cannot 
be ignored, so ignoring it in others just seems to confuse things.

With this in mind, I will admit that I am questioning the merit of 
changing Tango's toUtf8 routines to be named toString.  Doing so seems 
to sacrifice both operational consistency and clarity in an attempt to 
maintain consistency with the name of the transport mechanism: string. 
And as I have said above, while strings in D are generally expected to 
be Unicode, they are clearly not always Unicode, as the existence of 
std.windows.charset can attest.  So I am left wondering whether someone 
can explain why toString is the preferred name for string-producing 
routines in D?  I feel it is very important to establish a consistent 
naming mechanism for D, and as Phobos seems to be the model in this case 
I may well have no choice in the matter of toUtf8 vs. toString.  But I 
would feel much better about the change if someone could provide a sound 
reason for doing so, since my first attempt at a conversion has left me 
somewhat worried about its long-term effect on code clarity.

As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
be named toString, toWString, and toDString, respectively, and Unicode 
should be assumed as the standard encoding format in D.


Sean



More information about the Digitalmars-d mailing list