read files ... continued

Daniel Keep daniel.keep.lists at gmail.com
Sun May 13 03:54:48 PDT 2007



Jan Hanselaer wrote:
> Woops ... sent before done writing ... sorry
> 
> Hi
> 
> I'm writing an application that reads all kind of text files.
> I'm not really familiar with the filetypes.
> For the moment I read them with a BufferedFile.
> I read the lines with readLine()
> 
> Stream br = new BufferedFile(fileName);
> char[] line = br.readLine();
> 
> But that causes a lot of trouble. I managed to figure out how to read a file
> his BOM and so It'll also be possible I presume to convert them to a type
> (UTF8 for example) that I always use. (I'm checking that later)
> But for a lot of files when I check the BOM I get result -1 (meaning the 
> type is not known).
> http://www.digitalmars.com/d/phobos/std_stream.html
> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
> 
> For a lot of text files on my system (windows) the type is ANSI, and there's 
> no problem reading them with BufferedFile if there are no special signs in 
> it.
> But if there's an accent or something (for example '�'), than it's an 
> invalid UTF sequence. I cannot convert the text because the BOM for this 
> files is also unknown.
> 
> Anyone has an idea of how to catch this sort of files (and convert them?) Or 
> is there a stream that takes into account the filetype by itself? Would be 
> very handy ...
> 
> It's an application I wrote in Java I'm now trying in D. In Java I used a 
> BufferedReader on A FileReader and there all goes well. Sometimes files are 
> not read well, but no faults like this invalid UTF-sequence in D.
> 
> If someone unterstands my problem out of all this confusing talk (that's 
> because I'm rather confused myself) ... I'd be glad :p
> 
> Thanks!

Basically, the problem is that if the file is in something other than
ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
what it's supposed to be.

There are various methods for autodetecting the codepage of a piece of
text, but none of them are foolproof.  Hence why there isn't a stream to
do this for you; it's a nasty, horrible problem that no one wants to
solve... at least, I know I don't. :)

Incidentally, notice that the example accent you provided didn't show
up; presumably because your mail reader doesn't know how to use
codepages properly. :3

(Checks headers) Outlook Express—why am I not surprised :P

Anyway, if you need to open files that aren't in a usable encoding,
there's a few things you can do:

1. Read the text as ASCII, and discard all characters that lie outside
of the 7-bit range.
2. Add an option somewhere, or perhaps a tag to the file, to indicate
what the code page is.
3. Find and use one of those auto-detection algorithms.

In any case, you'll need a library for converting between codepages.  I
*think* that either Tango or Mango has one, but I'm not sure.

<shameless-plug>
Also, if you need more clarification on how text in D works, you can
give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
</shameless-plug>

Hope this has been of at least some help.

	-- Daniel

-- 
int getRandomNumber()
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

http://xkcd.com/

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D
i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/


More information about the Digitalmars-d-learn mailing list