Reading ASCII file with some codes above 127 (exten ascii)

Wed May 23 12:38:01 PDT 2012

On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
>>> wrote:
>>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett 
>>>>> wrote:
>>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>>> I am reading a file that has a few extended ASCII codes 
>>>>>>> (e.g. degree symdol). Depending on how I read the file in 
>>>>>>> and what I do with it the error shows up at different 
>>>>>>> points.  I'm pretty sure it all boils down to the these 
>>>>>>> extended ascii codes.
>>>>>>>
>>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 
>>>>>>> 8859-1 file?
>>>>>>> I've messed with the std.encoding module but really can't 
>>>>>>> figure out what I need to do.
>>>>>>>
>>>>>>> There must be a simple solution to this.
>>>>>>
>>>>>> This seems to work:
>>>>>>
>>>>>>
>>>>>> import std.stdio, std.file, std.encoding;
>>>>>>
>>>>>> void main()
>>>>>> {
>>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>>> string s;
>>>>>> transcode(latin, s);
>>>>>> writeln(s);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Graham
>>>>>
>>>>> I thought I was in good shape with your above suggestion.  
>>>>> I does help me read and process text.  But when I go to 
>>>>> print it out I have problems.
>>>>>
>>>>> Here is my input file:
>>>>> °F
>>>>>
>>>>> Here is my code:
>>>>> import std.stdio;
>>>>> import std.string;
>>>>> import std.file;
>>>>> import std.encoding;
>>>>>
>>>>> // Main function
>>>>> void main(){
>>>>> auto fout = File("out.txt","w");
>>>>> auto latinS = cast(Latin1String) read("in.txt");
>>>>> string uniS;
>>>>> transcode(latinS, uniS);
>>>>> foreach(line; uniS.splitLines()){
>>>>>    transcode(line, latinS);
>>>>>    fout.writeln(line);
>>>>>    fout.writeln(latinS);
>>>>> }
>>>>> }
>>>>>
>>>>> Here is the output:
>>>>> Â°F
>>>>> [cast(immutable(Latin1Char))176, 
>>>>> cast(immutable(Latin1Char))70]
>>>>>
>>>>> If I print the Unicode string I get an extra weird 
>>>>> character.
>>>>> If I print the Unicode string retranslated to Latin1, it 
>>>>> get weird pseudo-code.
>>>>> Can you help?
>>>>
>>>> I tried the program and it seemed to work for me.
>>>>
>>>> What program are you using to read "out.txt"? Are you sure 
>>>> it supports UTF-8, and knows to open the file as UTF-8? 
>>>> (This looks suspiciously like a tool's attempt to 
>>>> misinterpret a UTF-8 string as Latin-1.)
>>>>
>>>> If you're on a Unix system, what does "file in.txt out.txt" 
>>>> report?
>>>>
>>>> Graham
>>>
>>> Hmmm.  I'm not communicating well.
>>> I want to read and write ASCII.  The only reason I'm 
>>> converting to Unicode is because D needs it (as I understand).
>>>
>>> Yes if I open Â°F in notepad++ and tell notepad++ that it 
>>> is UTF-8, it shows °F.
>>>
>>> I want to:
>>> 1) Read an ascii file that may have codes above 127.
>>> 2) Convert to unicode so D funcs like .splitLines() can work 
>>> with it.
>>> 3) Convert back to ascii so that stuff like °F writes out as 
>>> it was read in.
>>>
>>> If I open in.txt and out.txt in an ascii editor, °F should 
>>> look the same in both files with the editor encoding the 
>>> files as ANSI/ASCII.  I thought my program was doing just 
>>> that.
>>> Thanks for your assistance.
>>
>> To make sure we're on the same page -- ASCII is a 7-bit 
>> encoding, and any character above 127 is by definition not an 
>> ASCII character. At that point we're talking about an encoding 
>> other than ASCII, such as UTF-8 or Latin-1.
>>
>> If you're reading a file that has bytes > 127, you really have 
>> no choice but to specify (assume?) an encoding, Latin-1 for 
>> example. There's no guarantee your input file is Latin-1, 
>> though, and garbage-in will result in garbage-out.
>>
>> So I think what you're trying to do is
>>
>> 1. read a Latin-1 file, into unicode (internally in D)
>> 2. do splitLines(), etc., generating some result
>> 3. Convert the result back to latin-1, and output it.
>>
>> Is that right?
>> Graham
>
> Exactly.

This works, though it's ugly:

     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

The Latin1String type, at the storage level, is a ubyte[]. By 
casting to char[], you can get a similar-to-string thing that 
writeln() can handle.

Graham