Selectable encodings

James Dunne james.jdunne at gmail.com
Thu Apr 6 07:53:26 PDT 2006


Oskar Linde wrote:
> Mike Capp skrev:
> 
>> In article <e12j34$2gi2$1 at digitaldaemon.com>, John C says...
>>
>>> version (utf8) alias mlchar char;
>>
>>
>> Apologies for going off at a tangent to your question, but I've never 
>> quite
>> understood what D thinks it's doing here. If char[] is an array of 
>> characters,
>> then it can't be a UTF-8 string, because UTF-8 is a variable-length 
>> encoding. So
>> is char[] an array of characters from some other charset (e.g. the 
>> subset of
>> UTF-8 representable in one byte), or is it an array of bytes encoding 
>> a UTF-8
>> string (in which case I suspect quite a lot of string-handling code is 
>> badly
>> broken)?
> 
> 
> It is the latter. But I don't think much of the string handling code is 
> broken because of that.
> 
> /Oskar

The char type is really a misnomer for dealing with UTF-8 encoded 
strings.  It should be named closer to "code-unit for UTF-8 encoding". 
For my own research language I've chosen what I believe to be a nice 
type naming system:

     char            - 32-bit Unicode code point

     u8cu            - UTF-8 code unit
     u16cu           - UTF-16 code unit
     u32cu           - UTF-32 code unit

I could be wrong (and I bet I am) on the terminology used to describe 
char, but I really mean it to just store a full Unicode character such 
that strings of chars can safely assume character index == array index.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne



More information about the Digitalmars-d mailing list