Reading ASCII file with some codes above 127 (exten ascii)

Era Scarecrow rtcvb32 at yahoo.com
Sun May 13 14:50:00 PDT 2012


On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. 
> degree symdol). Depending on how I read the file in and what I 
> do with it the error shows up at different points.  I'm pretty 
> sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
> file?
> I've messed with the std.encoding module but really can't 
> figure out what I need to do.
>
> There must be a simple solution to this.

  Same here. I've ended up writing a custom array converter that 
if there's any 128+ codes it converts it and returns a new array. 
Maybe this is wrong, but for me it works.

import std.utf;
import std.ascii;

//conversion table of ascii (latin-1?) to unicode for text 
compares.
//only 128-255
private immutable wchar[] extAscii = [
   0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
   0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F,
   0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
   0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178,
   0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7,
   0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
   0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7,
   0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF,
   0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7,
   0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF,
   0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
   0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
   0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7,
   0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
   0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7,
   0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF];

/**since I can't find a good explanation of conversion, this is 
custom made.
    if it doesn't need to be converted, it returns the original 
buffer*/
char[] ascii2char(ubyte[] input) {
   char[] o;

   foreach(i, b; input) {
     if (b & 0x80) {
       if (!o.length)
         o = cast(char[]) input[0 .. i];

       encode(o, extAscii[b - 0x80]);
     } else if (o.length)
       o ~= b;
   }

   return o.length ? o : cast(char[]) input;
}


More information about the Digitalmars-d-learn mailing list