toString vs. toUtf8

Mon Nov 19 18:56:35 PST 2007

Sean Kelly wrote:
> I was looking at converting Tango's use of toUtf8 to toString today and 
> ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as the 
> member function for returning char strings is consitent with all use of 
> string operations in Tango.  Routines that return wchar strings are 
> named toUtf16 whether they are members of the String class or whether 
> they are intended to perform UTF conversions, and so on.  Thus, the 
> convention is consitent and pervasive.
> 
> What I discovered during a test conversion of Tango was that converting 
> all uses of toUtf8 to toString /except/ those intended to perfom UTF 
> conversions reduced code clarity, and left me unsure as to which name I 
> would actually use in a given situation.  For example, there is quite a 
> bit of code in the text and io packages which convert an arbitrary type 
> to a char[] for output, etc.  So by making this change I was left with 
> some conversions using toString and others using toUtf8, toUtf16, and 
> toUtf32, not to mention the fromXxx versions of these same functions. As 
> this is template code, the choice between toString and toUtf8 in a given 
> situation was unclear.  Given this, I decided to look to Phobos for 
> model to follow.
> 
> What I found in Phobos was that it suffers from the same situation as I 
> found Tango in during my test conversion.  Routines that convert any 
> type by a string to a char[] are named toString, while the string 
> equivalent is named toUTF8.  Given this, I surmised that the naming 
> convention in D is that all strings are assumed to be Unicode, except 
> when they're not.  String literals are required to be Unicode, foreach 
> assumes strings to be UTF encoded when performing its automatic 
> conversions, and all of the toString functions in std.string assume 
> UTF-8 as the output format.  So who bother with the name toUTF8 in std.utf?
> 
> As near as I can tell, the reason for text conversion routines to be 
> named differently is to simplify the use of routines which covert to 
> another format.  std.windows.charset, for example, has a routine called 
> toMBSz, to distinguish from the toUTF8 routine.  What I find significant 
> about this is that it suggests that while the transport mechanism for 
> strings is the same in each case (both routines return a char[], ie. a 
> string), 

Does that even work?  I would think there are some valid MBSz's that are 
invalid UTF sequences, and so toMBSz would have to return byte[].

> the underlying encoding is different.  Thus there seems a clear 
> disconnect between the name of the transport mechanism (string), and 
> routines that generate them.  With this in mind, I begin to question the 
> point of having toString as the common name for routines that generate 
> char strings.  The encoding clearly matters in some instances and cannot 
> be ignored, so ignoring it in others just seems to confuse things.

As far as I'm concerned Utf8 is *the* encoding for text in D.  Anything 
else is only for some special purpose like ease of manipulation (dstring 
for I18N text that needs fast searching / slicing) or interchange with 
external APIs (utf16 for working with windows).

> With this in mind, I will admit that I am questioning the merit of 
> changing Tango's toUtf8 routines to be named toString.  Doing so seems 
> to sacrifice both operational consistency and clarity in an attempt to 
> maintain consistency with the name of the transport mechanism: string. 
> And as I have said above, while strings in D are generally expected to 
> be Unicode, they are clearly not always Unicode, as the existence of 
> std.windows.charset can attest.  

I really think toMBSz should be returning byte[] and fromMBSz should be 
taking a byte*.  The doc for types says char is unsigned 8 bit UTF-8. 
Period.  And you get errors from the compiler if you try to initialize a 
string with something that's not valid UTF-8.  So MBSz data has no 
business parading around dressed up as char[].

> So I am left wondering whether someone 
> can explain why toString is the preferred name for string-producing 
> routines in D?  I feel it is very important to establish a consistent 
> naming mechanism for D, and as Phobos seems to be the model in this case 
> I may well have no choice in the matter of toUtf8 vs. toString.  But I 
> would feel much better about the change if someone could provide a sound 
> reason for doing so, since my first attempt at a conversion has left me 
> somewhat worried about its long-term effect on code clarity.
> 
> As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 
> be named toString, toWString, and toDString, respectively, and Unicode 
> should be assumed as the standard encoding format in D.

Since the tango convention is to treat acronyms as single words, (the 
actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems 
there's an argument for treating wstring and dstring as single entities 
too.  So then it would be:
     toString, toWstring, toDstring

Don't know if that hurts your eyes less or not, but it seems more 
consistent with Tango's existing naming convention to me than toWString, 
etc.

--bb