read files ... continued

Jan Hanselaer jan.hanselaer at gmail.com
Sun May 13 06:04:06 PDT 2007


"Daniel Keep" <daniel.keep.lists at gmail.com> schreef in bericht 
news:f26qq0$1d16$1 at digitalmars.com...
>
>
> Jan Hanselaer wrote:
>> Woops ... sent before done writing ... sorry
>>
>> Hi
>>
>> I'm writing an application that reads all kind of text files.
>> I'm not really familiar with the filetypes.
>> For the moment I read them with a BufferedFile.
>> I read the lines with readLine()
>>
>> Stream br = new BufferedFile(fileName);
>> char[] line = br.readLine();
>>
>> But that causes a lot of trouble. I managed to figure out how to read a 
>> file
>> his BOM and so It'll also be possible I presume to convert them to a type
>> (UTF8 for example) that I always use. (I'm checking that later)
>> But for a lot of files when I check the BOM I get result -1 (meaning the
>> type is not known).
>> http://www.digitalmars.com/d/phobos/std_stream.html
>> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
>>
>> For a lot of text files on my system (windows) the type is ANSI, and 
>> there's
>> no problem reading them with BufferedFile if there are no special signs 
>> in
>> it.
>> But if there's an accent or something (for example '?'), than it's an
>> invalid UTF sequence. I cannot convert the text because the BOM for this
>> files is also unknown.
>>
>> Anyone has an idea of how to catch this sort of files (and convert them?) 
>> Or
>> is there a stream that takes into account the filetype by itself? Would 
>> be
>> very handy ...
>>
>> It's an application I wrote in Java I'm now trying in D. In Java I used a
>> BufferedReader on A FileReader and there all goes well. Sometimes files 
>> are
>> not read well, but no faults like this invalid UTF-sequence in D.
>>
>> If someone unterstands my problem out of all this confusing talk (that's
>> because I'm rather confused myself) ... I'd be glad :p
>>
>> Thanks!
>
> Basically, the problem is that if the file is in something other than
> ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
> what it's supposed to be.
>
> There are various methods for autodetecting the codepage of a piece of
> text, but none of them are foolproof.  Hence why there isn't a stream to
> do this for you; it's a nasty, horrible problem that no one wants to
> solve... at least, I know I don't. :)
>
> Incidentally, notice that the example accent you provided didn't show
> up; presumably because your mail reader doesn't know how to use
> codepages properly. :3
>
> (Checks headers) Outlook Express-why am I not surprised :P
>
> Anyway, if you need to open files that aren't in a usable encoding,
> there's a few things you can do:
>
> 1. Read the text as ASCII, and discard all characters that lie outside
> of the 7-bit range.
> 2. Add an option somewhere, or perhaps a tag to the file, to indicate
> what the code page is.
> 3. Find and use one of those auto-detection algorithms.
>
> In any case, you'll need a library for converting between codepages.  I
> *think* that either Tango or Mango has one, but I'm not sure.
>
> <shameless-plug>
> Also, if you need more clarification on how text in D works, you can
> give this a read: 
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
> </shameless-plug>
>
> Hope this has been of at least some help.

Yes, thanks a lot. At least now I understand it more. But it's a pity there 
isn't a stream doing all the work.
It's going to be very difficult to read the different files in a proper way.

>
> -- Daniel
>
> -- 
> int getRandomNumber()
> {
>    return 4; // chosen by fair dice roll.
>              // guaranteed to be random.
> }
>
> http://xkcd.com/
>
> v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D
> i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/ 




More information about the Digitalmars-d-learn mailing list