First Impressions

Sun Oct 1 14:56:29 PDT 2006

Anders F Björklund wrote:
> BCS wrote:
> 
>> One alternative that I could live with would use 4 character types:
>>
>> char    one codeunit in whatever encoding the runtime uses
>> schar    one 8 bit code unit (ASCII or utf-8)
>> wchar    one 16 bit code unit (same as before)
>> dchar    one 32 bit code unit (same as before)
> 
> 
> We have that already:
> 
> ubyte   one codeunit in whatever encoding the runtime uses
 > char    one 8 bit code unit (ASCII or utf-8)

ubyte is an 8 bit unsigned number not a character encoding.

[after some more reading]
I may be just rambling but...

how about have the type of the value denote the encoding. One for ASCII 
would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 
32. Direct assignment would be illegal (as with, say int[] -> Object) or 
implicitly converted (as with int -> real). Casts would be provided. 
Indexing would be by codepoint. Non-array variables would be big enough 
to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of 
"whatever the system uses" data type (ah la C's int) could be used for 
actual output, maybe even escaping anything that won't get displayed 
correctly.

This all sort of follows the idea of "call it what it is and don't hide 
the overhead". 1) Characters are a different type of data than numbers 
(see the threads on bool) and as such, that should be reflected in the 
type system. 2) I have no problem with high overhead operations as long 
as I can avoid using them when I don't want to.

> 
> There is no support in Phobos for runtime/native encodings,
> but you can use the "iconv" library to do such conversions ?
> 
>> (using the same thing for ASCII and UTF-8 may be a problem, but this 
>> isn't my field)
> 
> 
> All ASCII characters are valid UTF-8 code units, so it's OK.
> 

But UTF-8 is not ASCII.

> --anders