toString vs. toUtf8
Sean Kelly
sean at f4.ca
Mon Nov 19 20:15:32 PST 2007
Bill Baxter wrote:
> Sean Kelly wrote:
>> I was looking at converting Tango's use of toUtf8 to toString today
>> and ran into a bit of a quandry. Currently, Tango's use of toUtf8 as
>> the member function for returning char strings is consitent with all
>> use of string operations in Tango. Routines that return wchar strings
>> are named toUtf16 whether they are members of the String class or
>> whether they are intended to perform UTF conversions, and so on.
>> Thus, the convention is consitent and pervasive.
>>
>> What I discovered during a test conversion of Tango was that
>> converting all uses of toUtf8 to toString /except/ those intended to
>> perfom UTF conversions reduced code clarity, and left me unsure as to
>> which name I would actually use in a given situation. For example,
>> there is quite a bit of code in the text and io packages which convert
>> an arbitrary type to a char[] for output, etc. So by making this
>> change I was left with some conversions using toString and others
>> using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx
>> versions of these same functions. As this is template code, the choice
>> between toString and toUtf8 in a given situation was unclear. Given
>> this, I decided to look to Phobos for model to follow.
>>
>> What I found in Phobos was that it suffers from the same situation as
>> I found Tango in during my test conversion. Routines that convert any
>> type by a string to a char[] are named toString, while the string
>> equivalent is named toUTF8. Given this, I surmised that the naming
>> convention in D is that all strings are assumed to be Unicode, except
>> when they're not. String literals are required to be Unicode, foreach
>> assumes strings to be UTF encoded when performing its automatic
>> conversions, and all of the toString functions in std.string assume
>> UTF-8 as the output format. So who bother with the name toUTF8 in
>> std.utf?
>>
>> As near as I can tell, the reason for text conversion routines to be
>> named differently is to simplify the use of routines which covert to
>> another format. std.windows.charset, for example, has a routine
>> called toMBSz, to distinguish from the toUTF8 routine. What I find
>> significant about this is that it suggests that while the transport
>> mechanism for strings is the same in each case (both routines return a
>> char[], ie. a string),
>
> Does that even work? I would think there are some valid MBSz's that are
> invalid UTF sequences, and so toMBSz would have to return byte[].
It works because D performs no run-time verification that what's in a
char[] is actually Unicode. You could dump binary data in a string if
you really wanted to.
> I really think toMBSz should be returning byte[] and fromMBSz should be
> taking a byte*. The doc for types says char is unsigned 8 bit UTF-8.
> Period. And you get errors from the compiler if you try to initialize a
> string with something that's not valid UTF-8. So MBSz data has no
> business parading around dressed up as char[].
I think you're right about toMBSz.
> Since the tango convention is to treat acronyms as single words, (the
> actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems
> there's an argument for treating wstring and dstring as single entities
> too. So then it would be:
> toString, toWstring, toDstring
>
> Don't know if that hurts your eyes less or not, but it seems more
> consistent with Tango's existing naming convention to me than toWString,
> etc.
Yeah I was thinking the same thing. It's certainly easier for me to
read than the other form.
Sean
More information about the Digitalmars-d
mailing list