First Impressions
BCS
BCS at pathlink.com
Sun Oct 1 14:56:29 PDT 2006
Anders F Björklund wrote:
> BCS wrote:
>
>> One alternative that I could live with would use 4 character types:
>>
>> char one codeunit in whatever encoding the runtime uses
>> schar one 8 bit code unit (ASCII or utf-8)
>> wchar one 16 bit code unit (same as before)
>> dchar one 32 bit code unit (same as before)
>
>
> We have that already:
>
> ubyte one codeunit in whatever encoding the runtime uses
> char one 8 bit code unit (ASCII or utf-8)
ubyte is an 8 bit unsigned number not a character encoding.
[after some more reading]
I may be just rambling but...
how about have the type of the value denote the encoding. One for ASCII
would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and
32. Direct assignment would be illegal (as with, say int[] -> Object) or
implicitly converted (as with int -> real). Casts would be provided.
Indexing would be by codepoint. Non-array variables would be big enough
to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of
"whatever the system uses" data type (ah la C's int) could be used for
actual output, maybe even escaping anything that won't get displayed
correctly.
This all sort of follows the idea of "call it what it is and don't hide
the overhead". 1) Characters are a different type of data than numbers
(see the threads on bool) and as such, that should be reflected in the
type system. 2) I have no problem with high overhead operations as long
as I can avoid using them when I don't want to.
>
> There is no support in Phobos for runtime/native encodings,
> but you can use the "iconv" library to do such conversions ?
>
>> (using the same thing for ASCII and UTF-8 may be a problem, but this
>> isn't my field)
>
>
> All ASCII characters are valid UTF-8 code units, so it's OK.
>
But UTF-8 is not ASCII.
> --anders
More information about the Digitalmars-d
mailing list