print non-ASCII/UTF-8 string
Pragma
ericanderton at yahoo.removeme.com
Fri Dec 22 07:07:19 PST 2006
Egor Starostin wrote:
> Let's say that file q.txt contains some characters bigger than 0x7f (for
> example, from windows-1252 encoding).
> In such case the following snippet:
> ***
> import std.stream;
> void main() {
> Stream f = new BufferedFile("q.txt");
> for (char[] l; f) {
> writefln(l);
> }
> }
> ***
> will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
> UTF-8, right?
>
> My question is: is there any way to print out non-UTF-8 data exactly in the
> same encoding (which may be unknown) as in original file?
It's funny that you should bring this up now. I had a thread over in
d.D.learn regarding this very thing. The following should help you get
started:
char[] Latin1ToUTF8(char[] value){
char[] result;
for(uint i=0; i<value.length; i++){
char ch = value[i];
if(ch < 0x80){
result ~= ch;
}
else{
result ~= 0xC0 | (ch >> 6);
result ~= 0x80 | (ch & 0x3F);
}
}
return result;
}
(this could be optimized to use fewer concatenations, but I think it
gets the point across)
I have no clue how to work from other code pages, as I gather the
transform would be far less than straightforward as Latin-1.
Also, I have no idea how to *detect* what code page is being used
based on the input set. I don't even know if that's possible, like you,
, I'd love to hear about it should someone else know of an algorithm.
--
- EricAnderton at yahoo
More information about the Digitalmars-d
mailing list