toString vs. toUtf8

Tue Nov 20 02:25:25 PST 2007

Sean Kelly wrote:
> Christopher Wright wrote:
>> toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits 
>> well with other conventions.
> 
> I tend to place a tremendous amount of value on consistency, because the 
> more consistent an API is, the more likely my guesses about it are to be 
> correct.  In my opinion, that precludes using the option you suggest.

IMHO, the consistent alternative is pretty clear:

char -> string -> toString
wchar -> wstring -> toWString
dchar -> dstring -> toDString

The only problem seems to lie in the aesthetics of the camelCase 
convention, but doesn't consistency trump aesthetics?

> In my opinion, Walter's suggestion that alternate encodings not be 
> stored in strings is sufficient reason to not bother with the encoding 
> format in the function name (ie. toUtf8/toUtf16/toUtf32). 

I agree, but this is hardly a new suggestion. I think it has always been 
pretty clear that one should never store anything but UTF-encoded data 
in {,w,d}char[]s. Also, I have always felt Tangos toUtf{8,16,32} are a 
bit too explicitly named. Almost like using toSingleIEEE754 instead of 
toFloat.

> I don't suppose there is anyone who does a lot of internationalization 
> programming who can comment on the utility of one convention vs. the 
> other?  I would love to hear some more practical concerns regarding the 
> naming convention for these functions.

I have done quite a bit of text processing and handling of different 
encodings in D and while naming doesn't matter much as long as it is 
consistent, what I do is:

* use {,w,d}char strictly for UTF data (I have sometimes cheated here, 
mainly to be able to use certain std.string functions, but with a good 
templated string/array library (such as in Tango), that is not necessary)

* use unicode internally as much as possible, transcoding as early and 
late as possible.

* when there is a reason not to use UTF internally, use typedefs like 
"typedef char lat1", and keep unknown encodings as ubyte[]s.

Knowing that {,w,d}chars always contain UTF has never been a problem. 
Problems arising are instead of mistakingly using char rather than 
{,u}byte in C APIs and D's horrible behavior of by default crashing 
instead of recovering from UTF errors.

A much better default behavior would be to simply substitute illegal 
UTF-units with a '?' and keep going. Having to remember to sanitize all 
untrusted unicode strings is a chore, and forgetting that at any point 
will lead to crashes in running code at inconvenient situations.

-- 
Oskar