invalid utf-8 sequence

Wed Jan 7 13:44:48 PST 2009

james wrote:
> Jarrett Billingsley Wrote:
> 
>> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4 at gmail.com> wrote:
>>> im writing an indexer, but im having a problem because on some file, when i read gives this error
>>>
>>> Error 4: invalid UTF-8 sequence
>>>
>>> is there a way to fix it.
>>
>> You're probably reading a file that's encoded in some non-Unicode
>> encoding, like Latin-1.  You could read in the file data as byte[]
>> instead of as char[], but that still doesn't deal with the problem
>> that you have characters in your file that are outside the ASCII
>> range.  If you know what encoding your file uses, you could do some
>> transformations on it to turn it into valid Unicode, or you could just
>> ignore characters outside the ASCII range :P
> 
> is there any library or function that can automatically convert these unknown html charset into UTF-8

You mean that tries to work out what character set a file is in and then 
translates it?

(What is the current state of the art of character set detection 
heuristics?)

Stewart.