toString vs. toUtf8
Sean Kelly
sean at f4.ca
Tue Nov 20 10:28:57 PST 2007
Oskar Linde wrote:
> Sean Kelly wrote:
>> Christopher Wright wrote:
>>> toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits
>>> well with other conventions.
>>
>> I tend to place a tremendous amount of value on consistency, because
>> the more consistent an API is, the more likely my guesses about it are
>> to be correct. In my opinion, that precludes using the option you
>> suggest.
>
> IMHO, the consistent alternative is pretty clear:
>
> char -> string -> toString
> wchar -> wstring -> toWString
> dchar -> dstring -> toDString
>
> The only problem seems to lie in the aesthetics of the camelCase
> convention, but doesn't consistency trump aesthetics?
It depends :-) I prefer the suggested toStringW and toStringD
convention. While it doesn't exactly match the returned type name in
letter order, the same information is communicated and is done in what I
feel is a more readable format. Also, if the words were placed in a
larger list and then sorted, they would end up adjacent to one another.
>> In my opinion, Walter's suggestion that alternate encodings not be
>> stored in strings is sufficient reason to not bother with the encoding
>> format in the function name (ie. toUtf8/toUtf16/toUtf32).
>
> I agree, but this is hardly a new suggestion. I think it has always been
> pretty clear that one should never store anything but UTF-encoded data
> in {,w,d}char[]s.
Yup. But to me, this is different from a semi-official declaration to
this effect. With the latter, the suggestion is more likely to be
enforceable.
> Also, I have always felt Tangos toUtf{8,16,32} are a
> bit too explicitly named. Almost like using toSingleIEEE754 instead of
> toFloat.
Fair enough :-)
>> I don't suppose there is anyone who does a lot of internationalization
>> programming who can comment on the utility of one convention vs. the
>> other? I would love to hear some more practical concerns regarding
>> the naming convention for these functions.
>
> I have done quite a bit of text processing and handling of different
> encodings in D and while naming doesn't matter much as long as it is
> consistent, what I do is:
>
> * use {,w,d}char strictly for UTF data (I have sometimes cheated here,
> mainly to be able to use certain std.string functions, but with a good
> templated string/array library (such as in Tango), that is not necessary)
>
> * use unicode internally as much as possible, transcoding as early and
> late as possible.
>
> * when there is a reason not to use UTF internally, use typedefs like
> "typedef char lat1", and keep unknown encodings as ubyte[]s.
>
> Knowing that {,w,d}chars always contain UTF has never been a problem.
> Problems arising are instead of mistakingly using char rather than
> {,u}byte in C APIs and D's horrible behavior of by default crashing
> instead of recovering from UTF errors.
Darnit, I forgot about the C APIs. I'll have to replace their use of
char with char_t or c_char (the latter matches c_long but the former
matches wchar_t).
> A much better default behavior would be to simply substitute illegal
> UTF-units with a '?' and keep going. Having to remember to sanitize all
> untrusted unicode strings is a chore, and forgetting that at any point
> will lead to crashes in running code at inconvenient situations.
>
This is useful information. Thanks.
Sean
More information about the Digitalmars-d
mailing list