Selectable encodings
James Dunne
james.jdunne at gmail.com
Thu Apr 6 07:53:26 PDT 2006
Oskar Linde wrote:
> Mike Capp skrev:
>
>> In article <e12j34$2gi2$1 at digitaldaemon.com>, John C says...
>>
>>> version (utf8) alias mlchar char;
>>
>>
>> Apologies for going off at a tangent to your question, but I've never
>> quite
>> understood what D thinks it's doing here. If char[] is an array of
>> characters,
>> then it can't be a UTF-8 string, because UTF-8 is a variable-length
>> encoding. So
>> is char[] an array of characters from some other charset (e.g. the
>> subset of
>> UTF-8 representable in one byte), or is it an array of bytes encoding
>> a UTF-8
>> string (in which case I suspect quite a lot of string-handling code is
>> badly
>> broken)?
>
>
> It is the latter. But I don't think much of the string handling code is
> broken because of that.
>
> /Oskar
The char type is really a misnomer for dealing with UTF-8 encoded
strings. It should be named closer to "code-unit for UTF-8 encoding".
For my own research language I've chosen what I believe to be a nice
type naming system:
char - 32-bit Unicode code point
u8cu - UTF-8 code unit
u16cu - UTF-16 code unit
u32cu - UTF-32 code unit
I could be wrong (and I bet I am) on the terminology used to describe
char, but I really mean it to just store a full Unicode character such
that strings of chars can safely assume character index == array index.
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O
M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------
James Dunne
More information about the Digitalmars-d
mailing list