First Impressions

Sun Oct 1 15:30:57 PDT 2006

BCS wrote:

> ubyte is an 8 bit unsigned number not a character encoding.

Right, I actually meant ubyte[] but void[] might have been
more accurate for representing any (even non-UTF) encoding.
(I used ubyte[] in my mapping functions, since they only
used legacy 8-bit encodings like "cp1252" or "macroman")

Re-reading your post, it seems to me that you were more talking
about doing an alias to the UTF type most suitable for the OS ?

I guess UTF-8 would be a good choice if the operating system
doesn't use Unicode, since then it'll have to do lookups anyway.
Otherwise the existing "wchar_t" isn't bad for such an UTF type,
it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)

>> All ASCII characters are valid UTF-8 code units, so it's OK.
> 
> But UTF-8 is not ASCII.

So you would like a char "type" that would only take ASCII ?
I guess that is *one* way of dealing with it, you could also
have a wchar type that wouldn't accept surrogates (BMP only)

Then it would be OK to index them by code unit / character...
(since each allowed character would fit into one code unit)
Sounds a little like signed vs. unsigned integers actually ?

Then again, 5 character types is even worse than the 3 now.

--anders