toString vs. toUtf8
Sean Kelly
sean at f4.ca
Mon Nov 19 13:06:15 PST 2007
I was looking at converting Tango's use of toUtf8 to toString today and
ran into a bit of a quandry. Currently, Tango's use of toUtf8 as the
member function for returning char strings is consitent with all use of
string operations in Tango. Routines that return wchar strings are
named toUtf16 whether they are members of the String class or whether
they are intended to perform UTF conversions, and so on. Thus, the
convention is consitent and pervasive.
What I discovered during a test conversion of Tango was that converting
all uses of toUtf8 to toString /except/ those intended to perfom UTF
conversions reduced code clarity, and left me unsure as to which name I
would actually use in a given situation. For example, there is quite a
bit of code in the text and io packages which convert an arbitrary type
to a char[] for output, etc. So by making this change I was left with
some conversions using toString and others using toUtf8, toUtf16, and
toUtf32, not to mention the fromXxx versions of these same functions.
As this is template code, the choice between toString and toUtf8 in a
given situation was unclear. Given this, I decided to look to Phobos
for model to follow.
What I found in Phobos was that it suffers from the same situation as I
found Tango in during my test conversion. Routines that convert any
type by a string to a char[] are named toString, while the string
equivalent is named toUTF8. Given this, I surmised that the naming
convention in D is that all strings are assumed to be Unicode, except
when they're not. String literals are required to be Unicode, foreach
assumes strings to be UTF encoded when performing its automatic
conversions, and all of the toString functions in std.string assume
UTF-8 as the output format. So who bother with the name toUTF8 in std.utf?
As near as I can tell, the reason for text conversion routines to be
named differently is to simplify the use of routines which covert to
another format. std.windows.charset, for example, has a routine called
toMBSz, to distinguish from the toUTF8 routine. What I find significant
about this is that it suggests that while the transport mechanism for
strings is the same in each case (both routines return a char[], ie. a
string), the underlying encoding is different. Thus there seems a clear
disconnect between the name of the transport mechanism (string), and
routines that generate them. With this in mind, I begin to question the
point of having toString as the common name for routines that generate
char strings. The encoding clearly matters in some instances and cannot
be ignored, so ignoring it in others just seems to confuse things.
With this in mind, I will admit that I am questioning the merit of
changing Tango's toUtf8 routines to be named toString. Doing so seems
to sacrifice both operational consistency and clarity in an attempt to
maintain consistency with the name of the transport mechanism: string.
And as I have said above, while strings in D are generally expected to
be Unicode, they are clearly not always Unicode, as the existence of
std.windows.charset can attest. So I am left wondering whether someone
can explain why toString is the preferred name for string-producing
routines in D? I feel it is very important to establish a consistent
naming mechanism for D, and as Phobos seems to be the model in this case
I may well have no choice in the matter of toUtf8 vs. toString. But I
would feel much better about the change if someone could provide a sound
reason for doing so, since my first attempt at a conversion has left me
somewhat worried about its long-term effect on code clarity.
As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32
be named toString, toWString, and toDString, respectively, and Unicode
should be assumed as the standard encoding format in D.
Sean
More information about the Digitalmars-d
mailing list