toString vs. toUtf8
Oskar Linde
oskar.lindeREM at OVEgmail.com
Tue Nov 20 02:25:25 PST 2007
Sean Kelly wrote:
> Christopher Wright wrote:
>> toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits
>> well with other conventions.
>
> I tend to place a tremendous amount of value on consistency, because the
> more consistent an API is, the more likely my guesses about it are to be
> correct. In my opinion, that precludes using the option you suggest.
IMHO, the consistent alternative is pretty clear:
char -> string -> toString
wchar -> wstring -> toWString
dchar -> dstring -> toDString
The only problem seems to lie in the aesthetics of the camelCase
convention, but doesn't consistency trump aesthetics?
> In my opinion, Walter's suggestion that alternate encodings not be
> stored in strings is sufficient reason to not bother with the encoding
> format in the function name (ie. toUtf8/toUtf16/toUtf32).
I agree, but this is hardly a new suggestion. I think it has always been
pretty clear that one should never store anything but UTF-encoded data
in {,w,d}char[]s. Also, I have always felt Tangos toUtf{8,16,32} are a
bit too explicitly named. Almost like using toSingleIEEE754 instead of
toFloat.
> I don't suppose there is anyone who does a lot of internationalization
> programming who can comment on the utility of one convention vs. the
> other? I would love to hear some more practical concerns regarding the
> naming convention for these functions.
I have done quite a bit of text processing and handling of different
encodings in D and while naming doesn't matter much as long as it is
consistent, what I do is:
* use {,w,d}char strictly for UTF data (I have sometimes cheated here,
mainly to be able to use certain std.string functions, but with a good
templated string/array library (such as in Tango), that is not necessary)
* use unicode internally as much as possible, transcoding as early and
late as possible.
* when there is a reason not to use UTF internally, use typedefs like
"typedef char lat1", and keep unknown encodings as ubyte[]s.
Knowing that {,w,d}chars always contain UTF has never been a problem.
Problems arising are instead of mistakingly using char rather than
{,u}byte in C APIs and D's horrible behavior of by default crashing
instead of recovering from UTF errors.
A much better default behavior would be to simply substitute illegal
UTF-units with a '?' and keep going. Having to remember to sanitize all
untrusted unicode strings is a chore, and forgetting that at any point
will lead to crashes in running code at inconvenient situations.
--
Oskar
More information about the Digitalmars-d
mailing list