read files ... continued

Sun May 13 06:17:11 PDT 2007

Jan Hanselaer wrote:

> 
> "Daniel Keep" <daniel.keep.lists at gmail.com> schreef in bericht
> news:f26qq0$1d16$1 at digitalmars.com...
>>
>>
>> Jan Hanselaer wrote:
>>> Woops ... sent before done writing ... sorry
>>>
>>> Hi
>>>
>>> I'm writing an application that reads all kind of text files.
>>> I'm not really familiar with the filetypes.
>>> For the moment I read them with a BufferedFile.
>>> I read the lines with readLine()
>>>
>>> Stream br = new BufferedFile(fileName);
>>> char[] line = br.readLine();
>>>
>>> But that causes a lot of trouble. I managed to figure out how to read a
>>> file
>>> his BOM and so It'll also be possible I presume to convert them to a
>>> type (UTF8 for example) that I always use. (I'm checking that later)
>>> But for a lot of files when I check the BOM I get result -1 (meaning the
>>> type is not known).
>>> http://www.digitalmars.com/d/phobos/std_stream.html
>>> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
>>>
>>> For a lot of text files on my system (windows) the type is ANSI, and
>>> there's
>>> no problem reading them with BufferedFile if there are no special signs
>>> in
>>> it.
>>> But if there's an accent or something (for example '?'), than it's an
>>> invalid UTF sequence. I cannot convert the text because the BOM for this
>>> files is also unknown.
>>>
>>> Anyone has an idea of how to catch this sort of files (and convert
>>> them?) Or
>>> is there a stream that takes into account the filetype by itself? Would
>>> be
>>> very handy ...
>>>
>>> It's an application I wrote in Java I'm now trying in D. In Java I used
>>> a BufferedReader on A FileReader and there all goes well. Sometimes
>>> files are
>>> not read well, but no faults like this invalid UTF-sequence in D.
>>>
>>> If someone unterstands my problem out of all this confusing talk (that's
>>> because I'm rather confused myself) ... I'd be glad :p
>>>
>>> Thanks!
>>
>> Basically, the problem is that if the file is in something other than
>> ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
>> what it's supposed to be.
>>
>> There are various methods for autodetecting the codepage of a piece of
>> text, but none of them are foolproof.  Hence why there isn't a stream to
>> do this for you; it's a nasty, horrible problem that no one wants to
>> solve... at least, I know I don't. :)
>>
>> Incidentally, notice that the example accent you provided didn't show
>> up; presumably because your mail reader doesn't know how to use
>> codepages properly. :3
>>
>> (Checks headers) Outlook Express-why am I not surprised :P
>>
>> Anyway, if you need to open files that aren't in a usable encoding,
>> there's a few things you can do:
>>
>> 1. Read the text as ASCII, and discard all characters that lie outside
>> of the 7-bit range.
>> 2. Add an option somewhere, or perhaps a tag to the file, to indicate
>> what the code page is.
>> 3. Find and use one of those auto-detection algorithms.
>>
>> In any case, you'll need a library for converting between codepages.  I
>> *think* that either Tango or Mango has one, but I'm not sure.
>>
>> <shameless-plug>
>> Also, if you need more clarification on how text in D works, you can
>> give this a read:
>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>> </shameless-plug>
>>
>> Hope this has been of at least some help.
> 
> Yes, thanks a lot. At least now I understand it more. But it's a pity
> there isn't a stream doing all the work.
> It's going to be very difficult to read the different files in a proper
> way.

Currently Mango has bindings for IBM's ICU library, which may be the most
comprehensive solution for this type of text handling.

--- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango