toString vs. toUtf8

Tue Nov 20 10:28:57 PST 2007

Oskar Linde wrote:
> Sean Kelly wrote:
>> Christopher Wright wrote:
>>> toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
>>> well with other conventions.
>>
>> I tend to place a tremendous amount of value on consistency, because 
>> the more consistent an API is, the more likely my guesses about it are 
>> to be correct.  In my opinion, that precludes using the option you 
>> suggest.
> 
> IMHO, the consistent alternative is pretty clear:
> 
> char -> string -> toString
> wchar -> wstring -> toWString
> dchar -> dstring -> toDString
> 
> The only problem seems to lie in the aesthetics of the camelCase 
> convention, but doesn't consistency trump aesthetics?

It depends :-)  I prefer the suggested toStringW and toStringD 
convention.  While it doesn't exactly match the returned type name in 
letter order, the same information is communicated and is done in what I 
feel is a more readable format.  Also, if the words were placed in a 
larger list and then sorted, they would end up adjacent to one another.

>> In my opinion, Walter's suggestion that alternate encodings not be 
>> stored in strings is sufficient reason to not bother with the encoding 
>> format in the function name (ie. toUtf8/toUtf16/toUtf32). 
> 
> I agree, but this is hardly a new suggestion. I think it has always been 
> pretty clear that one should never store anything but UTF-encoded data 
> in {,w,d}char[]s.

Yup.  But to me, this is different from a semi-official declaration to 
this effect.  With the latter, the suggestion is more likely to be 
enforceable.

> Also, I have always felt Tangos toUtf{8,16,32} are a 
> bit too explicitly named. Almost like using toSingleIEEE754 instead of 
> toFloat.

Fair enough :-)

>> I don't suppose there is anyone who does a lot of internationalization 
>> programming who can comment on the utility of one convention vs. the 
>> other?  I would love to hear some more practical concerns regarding 
>> the naming convention for these functions.
> 
> I have done quite a bit of text processing and handling of different 
> encodings in D and while naming doesn't matter much as long as it is 
> consistent, what I do is:
> 
> * use {,w,d}char strictly for UTF data (I have sometimes cheated here, 
> mainly to be able to use certain std.string functions, but with a good 
> templated string/array library (such as in Tango), that is not necessary)
> 
> * use unicode internally as much as possible, transcoding as early and 
> late as possible.
> 
> * when there is a reason not to use UTF internally, use typedefs like 
> "typedef char lat1", and keep unknown encodings as ubyte[]s.
> 
> Knowing that {,w,d}chars always contain UTF has never been a problem. 
> Problems arising are instead of mistakingly using char rather than 
> {,u}byte in C APIs and D's horrible behavior of by default crashing 
> instead of recovering from UTF errors.

Darnit, I forgot about the C APIs.  I'll have to replace their use of 
char with char_t or c_char (the latter matches c_long but the former 
matches wchar_t).

> A much better default behavior would be to simply substitute illegal 
> UTF-units with a '?' and keep going. Having to remember to sanitize all 
> untrusted unicode strings is a chore, and forgetting that at any point 
> will lead to crashes in running code at inconvenient situations.
> 

This is useful information.  Thanks.

Sean