print non-ASCII/UTF-8 string

Pragma ericanderton at yahoo.removeme.com
Fri Dec 22 07:07:19 PST 2006


Egor Starostin wrote:
> Let's say that file q.txt contains some characters bigger than 0x7f (for
> example, from windows-1252 encoding).
> In such case the following snippet:
> ***
> import std.stream;
> void main() {
>   Stream f = new BufferedFile("q.txt");
>   for (char[] l; f) {
>     writefln(l);
>   }
> }
> ***
> will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
> UTF-8, right?
> 
> My question is: is there any way to print out non-UTF-8 data exactly in the
> same encoding (which may be unknown) as in original file?

It's funny that you should bring this up now.  I had a thread over in 
d.D.learn regarding this very thing.  The following should help you get 
started:

char[] Latin1ToUTF8(char[] value){
     char[] result;
     for(uint i=0; i<value.length; i++){
         char ch = value[i];
         if(ch < 0x80){
             result ~= ch;
         }
         else{
             result ~= 0xC0  | (ch >> 6);
             result ~= 0x80  | (ch & 0x3F);
         }
     }
     return result;
}

(this could be optimized to use fewer concatenations, but I think it 
gets the point across)

I have no clue how to work from other code pages, as I gather the 
transform would be far less than straightforward as Latin-1.

   Also, I have no idea how to *detect* what code page is being used 
based on the input set. I don't even know if that's possible, like you, 
  , I'd love to hear about it should someone else know of an algorithm.

-- 
- EricAnderton at yahoo



More information about the Digitalmars-d mailing list