Char & the Extended ascii set

Jonathan M Davis jmdavisProg at
Sat Jan 28 20:58:59 PST 2012

On Saturday, January 28, 2012 20:54:30 Era Scarecrow wrote:
>  It there any support for the extended ascii characters? (128-255). I
> understand unicode is important, however working with some data and
> programs that don't support those, I am getting a problem that the program
> causes an exception because it isn't valid utf-8. Do I have to handle it
> all as bytes/ubytes? If I do then I lose out on many char specific
> functions. Alternatively I can rely on the C functions, but I want to avoid
> using them if I can.
> Example: note the raw data below, being 39 vs -110
> this._ID = "SPEL_wulfharth's cups"
> rhs._ID  = "SPEL_wulfharth▒s cups"
> this._ID = [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 104,
> 39, 115, 32, 99, 117, 112, 115, 0] rhs._ID  = [83, 80, 69, 76, 95, 119,
> 117, 108, 102, 104, 97, 114, 116, 104, -110, 115, 32, 99, 117, 112, 115, 0]
> I have compiled and made a table for the appropriate conversions to proper
> unicode, which you can then use in reverse to get it back to it's previous
> state. However I'm not sure.
> //referenced from
> wchar[128] convertAsciiExtended = [
> 	0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7,
> 	0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5,
> 	0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9,
> 	0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192,
> 	0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA,
> 	0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB,
> 	0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
> 	0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
> 	0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
> 	0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
> 	0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
> 	0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
> 	0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x00B5, 0x03C4,
> 	0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B5, 0x2229,
> 	0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248,
> 	0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0];

char is UTF-8 by definition, and D code is free to assume that that's the case. 
A lot of the string processing code in Phobos will throw if you give it ill-
formed unicode.

Now, you can put whatever you want in a char, but don't expect other D code to 
handle it correctly.

The only support in Phobos for dealing with alternate encodings is 
std.encoding. It currently supports "UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1
(also known as LATIN-1), and WINDOWS-1252." So, if you can get that to do the 
conversions that you want, then there you go, but otherwise you're on your 

Regardless, you need to convert your chars to proper UTF-8 if you want other D 
code (and especially Phobos) to handle them correctly.

- Jonathan M Davis

More information about the Digitalmars-d-learn mailing list