Reading ASCII file with some codes above 127 (exten ascii)

Wed May 23 12:09:27 PDT 2012

On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
>> wrote:
>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>> I am reading a file that has a few extended ASCII codes 
>>>>>> (e.g. degree symdol). Depending on how I read the file in 
>>>>>> and what I do with it the error shows up at different 
>>>>>> points.  I'm pretty sure it all boils down to the these 
>>>>>> extended ascii codes.
>>>>>>
>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 
>>>>>> 8859-1 file?
>>>>>> I've messed with the std.encoding module but really can't 
>>>>>> figure out what I need to do.
>>>>>>
>>>>>> There must be a simple solution to this.
>>>>>
>>>>> This seems to work:
>>>>>
>>>>>
>>>>> import std.stdio, std.file, std.encoding;
>>>>>
>>>>> void main()
>>>>> {
>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>> string s;
>>>>> transcode(latin, s);
>>>>> writeln(s);
>>>>> }
>>>>>
>>>>>
>>>>> Graham
>>>>
>>>> I thought I was in good shape with your above suggestion.  I 
>>>> does help me read and process text.  But when I go to print 
>>>> it out I have problems.
>>>>
>>>> Here is my input file:
>>>> °F
>>>>
>>>> Here is my code:
>>>> import std.stdio;
>>>> import std.string;
>>>> import std.file;
>>>> import std.encoding;
>>>>
>>>> // Main function
>>>> void main(){
>>>>  auto fout = File("out.txt","w");
>>>>  auto latinS = cast(Latin1String) read("in.txt");
>>>>  string uniS;
>>>>  transcode(latinS, uniS);
>>>>  foreach(line; uniS.splitLines()){
>>>>     transcode(line, latinS);
>>>>     fout.writeln(line);
>>>>     fout.writeln(latinS);
>>>>  }
>>>> }
>>>>
>>>> Here is the output:
>>>> Â°F
>>>> [cast(immutable(Latin1Char))176, 
>>>> cast(immutable(Latin1Char))70]
>>>>
>>>> If I print the Unicode string I get an extra weird character.
>>>> If I print the Unicode string retranslated to Latin1, it get 
>>>> weird pseudo-code.
>>>> Can you help?
>>>
>>> I tried the program and it seemed to work for me.
>>>
>>> What program are you using to read "out.txt"? Are you sure it 
>>> supports UTF-8, and knows to open the file as UTF-8? (This 
>>> looks suspiciously like a tool's attempt to misinterpret a 
>>> UTF-8 string as Latin-1.)
>>>
>>> If you're on a Unix system, what does "file in.txt out.txt" 
>>> report?
>>>
>>> Graham
>>
>> Hmmm.  I'm not communicating well.
>> I want to read and write ASCII.  The only reason I'm 
>> converting to Unicode is because D needs it (as I understand).
>>
>> Yes if I open Â°F in notepad++ and tell notepad++ that it is 
>> UTF-8, it shows °F.
>>
>> I want to:
>> 1) Read an ascii file that may have codes above 127.
>> 2) Convert to unicode so D funcs like .splitLines() can work 
>> with it.
>> 3) Convert back to ascii so that stuff like °F writes out as 
>> it was read in.
>>
>> If I open in.txt and out.txt in an ascii editor, °F should 
>> look the same in both files with the editor encoding the files 
>> as ANSI/ASCII.  I thought my program was doing just that.
>> Thanks for your assistance.
>
> To make sure we're on the same page -- ASCII is a 7-bit 
> encoding, and any character above 127 is by definition not an 
> ASCII character. At that point we're talking about an encoding 
> other than ASCII, such as UTF-8 or Latin-1.
>
> If you're reading a file that has bytes > 127, you really have 
> no choice but to specify (assume?) an encoding, Latin-1 for 
> example. There's no guarantee your input file is Latin-1, 
> though, and garbage-in will result in garbage-out.
>
> So I think what you're trying to do is
>
> 1. read a Latin-1 file, into unicode (internally in D)
> 2. do splitLines(), etc., generating some result
> 3. Convert the result back to latin-1, and output it.
>
> Is that right?
> Graham

Exactly.