Reading ASCII file with some codes above 127 (exten ascii)
Graham Fawcett
fawcett at uwindsor.ca
Wed May 23 12:01:52 PDT 2012
On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>> I am reading a file that has a few extended ASCII codes
>>>>> (e.g. degree symdol). Depending on how I read the file in
>>>>> and what I do with it the error shows up at different
>>>>> points. I'm pretty sure it all boils down to the these
>>>>> extended ascii codes.
>>>>>
>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1
>>>>> file?
>>>>> I've messed with the std.encoding module but really can't
>>>>> figure out what I need to do.
>>>>>
>>>>> There must be a simple solution to this.
>>>>
>>>> This seems to work:
>>>>
>>>>
>>>> import std.stdio, std.file, std.encoding;
>>>>
>>>> void main()
>>>> {
>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>> string s;
>>>> transcode(latin, s);
>>>> writeln(s);
>>>> }
>>>>
>>>>
>>>> Graham
>>>
>>> I thought I was in good shape with your above suggestion. I
>>> does help me read and process text. But when I go to print
>>> it out I have problems.
>>>
>>> Here is my input file:
>>> °F
>>>
>>> Here is my code:
>>> import std.stdio;
>>> import std.string;
>>> import std.file;
>>> import std.encoding;
>>>
>>> // Main function
>>> void main(){
>>> auto fout = File("out.txt","w");
>>> auto latinS = cast(Latin1String) read("in.txt");
>>> string uniS;
>>> transcode(latinS, uniS);
>>> foreach(line; uniS.splitLines()){
>>> transcode(line, latinS);
>>> fout.writeln(line);
>>> fout.writeln(latinS);
>>> }
>>> }
>>>
>>> Here is the output:
>>> °F
>>> [cast(immutable(Latin1Char))176,
>>> cast(immutable(Latin1Char))70]
>>>
>>> If I print the Unicode string I get an extra weird character.
>>> If I print the Unicode string retranslated to Latin1, it get
>>> weird pseudo-code.
>>> Can you help?
>>
>> I tried the program and it seemed to work for me.
>>
>> What program are you using to read "out.txt"? Are you sure it
>> supports UTF-8, and knows to open the file as UTF-8? (This
>> looks suspiciously like a tool's attempt to misinterpret a
>> UTF-8 string as Latin-1.)
>>
>> If you're on a Unix system, what does "file in.txt out.txt"
>> report?
>>
>> Graham
>
> Hmmm. I'm not communicating well.
> I want to read and write ASCII. The only reason I'm converting
> to Unicode is because D needs it (as I understand).
>
> Yes if I open °F in notepad++ and tell notepad++ that it is
> UTF-8, it shows °F.
>
> I want to:
> 1) Read an ascii file that may have codes above 127.
> 2) Convert to unicode so D funcs like .splitLines() can work
> with it.
> 3) Convert back to ascii so that stuff like °F writes out as
> it was read in.
>
> If I open in.txt and out.txt in an ascii editor, °F should
> look the same in both files with the editor encoding the files
> as ANSI/ASCII. I thought my program was doing just that.
> Thanks for your assistance.
To make sure we're on the same page -- ASCII is a 7-bit encoding,
and any character above 127 is by definition not an ASCII
character. At that point we're talking about an encoding other
than ASCII, such as UTF-8 or Latin-1.
If you're reading a file that has bytes > 127, you really have no
choice but to specify (assume?) an encoding, Latin-1 for example.
There's no guarantee your input file is Latin-1, though, and
garbage-in will result in garbage-out.
So I think what you're trying to do is
1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.
Is that right?
Graham
More information about the Digitalmars-d-learn
mailing list