toString vs. toUtf8

Mon Nov 19 20:15:32 PST 2007

Bill Baxter wrote:
> Sean Kelly wrote:
>> I was looking at converting Tango's use of toUtf8 to toString today 
>> and ran into a bit of a quandry.  Currently, Tango's use of toUtf8 as 
>> the member function for returning char strings is consitent with all 
>> use of string operations in Tango.  Routines that return wchar strings 
>> are named toUtf16 whether they are members of the String class or 
>> whether they are intended to perform UTF conversions, and so on.  
>> Thus, the convention is consitent and pervasive.
>>
>> What I discovered during a test conversion of Tango was that 
>> converting all uses of toUtf8 to toString /except/ those intended to 
>> perfom UTF conversions reduced code clarity, and left me unsure as to 
>> which name I would actually use in a given situation.  For example, 
>> there is quite a bit of code in the text and io packages which convert 
>> an arbitrary type to a char[] for output, etc.  So by making this 
>> change I was left with some conversions using toString and others 
>> using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx 
>> versions of these same functions. As this is template code, the choice 
>> between toString and toUtf8 in a given situation was unclear.  Given 
>> this, I decided to look to Phobos for model to follow.
>>
>> What I found in Phobos was that it suffers from the same situation as 
>> I found Tango in during my test conversion.  Routines that convert any 
>> type by a string to a char[] are named toString, while the string 
>> equivalent is named toUTF8.  Given this, I surmised that the naming 
>> convention in D is that all strings are assumed to be Unicode, except 
>> when they're not.  String literals are required to be Unicode, foreach 
>> assumes strings to be UTF encoded when performing its automatic 
>> conversions, and all of the toString functions in std.string assume 
>> UTF-8 as the output format.  So who bother with the name toUTF8 in 
>> std.utf?
>>
>> As near as I can tell, the reason for text conversion routines to be 
>> named differently is to simplify the use of routines which covert to 
>> another format.  std.windows.charset, for example, has a routine 
>> called toMBSz, to distinguish from the toUTF8 routine.  What I find 
>> significant about this is that it suggests that while the transport 
>> mechanism for strings is the same in each case (both routines return a 
>> char[], ie. a string), 
> 
> Does that even work?  I would think there are some valid MBSz's that are 
> invalid UTF sequences, and so toMBSz would have to return byte[].

It works because D performs no run-time verification that what's in a 
char[] is actually Unicode.  You could dump binary data in a string if 
you really wanted to.

> I really think toMBSz should be returning byte[] and fromMBSz should be 
> taking a byte*.  The doc for types says char is unsigned 8 bit UTF-8. 
> Period.  And you get errors from the compiler if you try to initialize a 
> string with something that's not valid UTF-8.  So MBSz data has no 
> business parading around dressed up as char[].

I think you're right about toMBSz.

> Since the tango convention is to treat acronyms as single words, (the 
> actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems 
> there's an argument for treating wstring and dstring as single entities 
> too.  So then it would be:
>     toString, toWstring, toDstring
> 
> Don't know if that hurts your eyes less or not, but it seems more 
> consistent with Tango's existing naming convention to me than toWString, 
> etc.

Yeah I was thinking the same thing.  It's certainly easier for me to 
read than the other form.

Sean