Char & the Extended ascii set

Jonathan M Davis jmdavisProg at gmx.com
Sat Jan 28 20:58:59 PST 2012


On Saturday, January 28, 2012 20:54:30 Era Scarecrow wrote:
>  It there any support for the extended ascii characters? (128-255). I
> understand unicode is important, however working with some data and
> programs that don't support those, I am getting a problem that the program
> causes an exception because it isn't valid utf-8. Do I have to handle it
> all as bytes/ubytes? If I do then I lose out on many char specific
> functions. Alternatively I can rely on the C functions, but I want to avoid
> using them if I can.
> 
> Example: note the raw data below, being 39 vs -110
> 
> this._ID = "SPEL_wulfharth's cups"
> rhs._ID  = "SPEL_wulfharth▒s cups"
> 
> this._ID = [83, 80, 69, 76, 95, 119, 117, 108, 102, 104, 97, 114, 116, 104,
> 39, 115, 32, 99, 117, 112, 115, 0] rhs._ID  = [83, 80, 69, 76, 95, 119,
> 117, 108, 102, 104, 97, 114, 116, 104, -110, 115, 32, 99, 117, 112, 115, 0]
> 
> 
> I have compiled and made a table for the appropriate conversions to proper
> unicode, which you can then use in reverse to get it back to it's previous
> state. However I'm not sure.
> 
> //referenced from http://ascii-table.com/ascii-extended-pc-list.php
> wchar[128] convertAsciiExtended = [
> 	0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 0x00E5, 0x00E7,
> 	0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 0x00C4, 0x00C5,
> 	0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 0x00FB, 0x00F9,
> 	0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 0x20A7, 0x0192,
> 	0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 0x00AA, 0x00BA,
> 	0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 0x00AB, 0x00BB,
> 	0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
> 	0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
> 	0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
> 	0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
> 	0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
> 	0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
> 	0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 0x00B5, 0x03C4,
> 	0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 0x03B5, 0x2229,
> 	0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 0x00F7, 0x2248,
> 	0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 0x25A0, 0x00A0];

char is UTF-8 by definition, and D code is free to assume that that's the case. 
A lot of the string processing code in Phobos will throw if you give it ill-
formed unicode.

Now, you can put whatever you want in a char, but don't expect other D code to 
handle it correctly.

The only support in Phobos for dealing with alternate encodings is 
std.encoding. It currently supports "UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1
(also known as LATIN-1), and WINDOWS-1252." So, if you can get that to do the 
conversions that you want, then there you go, but otherwise you're on your 
own.

Regardless, you need to convert your chars to proper UTF-8 if you want other D 
code (and especially Phobos) to handle them correctly.

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list