regarding Latin1 to UTF8 encoding

Adam D. Ruppe destructionator at gmail.com
Sun Dec 8 19:19:51 PST 2013


On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
> Is there a way to detect the encoding prior to 
> typecasting/loading the file?

UTF-8 can be detected fairly reliably, but not much luck for 
other encodings. A Windows-1258 and a Latin1 file, for example, 
are usually fairly indistinguishable from a binary perspective - 
they use the same numbers, just for different things.

(It is possible to distinguish them if you use some context and 
grammar check kind of things, but that's not easy.)


But utf-8 has a neat feature: any non-ascii stuff needs to 
validate, and it is unlikely that random data would correctly 
validate.

std.utf.validate can do that (though it throws an exception if it 
fails, ugh!)

So here's how I did it in my own characterencodings.d:

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


         string utf8string;
         import std.utf;
         try {
                 validate!string(cast(string) rawdata);
                 // validation passed, assume it is UTF-8 and use 
it
                 utf8string = cast(string) rawdata;
         } catch(UTFException t) {
                // not utf-8, try latin1
                transcode(cast(Latin1String) rawData, utf8string);
         }

         // now go ahead and use utf8 string, it should be set


More information about the Digitalmars-d-learn mailing list