regarding Latin1 to UTF8 encoding

Hugo Florentino hugo at acdam.cu
Sun Dec 8 19:33:36 PST 2013


On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:
> On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
>> Is there a way to detect the encoding prior to typecasting/loading 
>> the file?
>
> UTF-8 can be detected fairly reliably, but not much luck for other
> encodings. A Windows-1258 and a Latin1 file, for example, are usually
> fairly indistinguishable from a binary perspective - they use the 
> same
> numbers, just for different things.
>
> (It is possible to distinguish them if you use some context and
> grammar check kind of things, but that's not easy.)
>
>
> But utf-8 has a neat feature: any non-ascii stuff needs to validate,
> and it is unlikely that random data would correctly validate.
>
> std.utf.validate can do that (though it throws an exception if it
> fails, ugh!)
>
> So here's how I did it in my own characterencodings.d:
>
> 
> https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138
>
>
>         string utf8string;
>         import std.utf;
>         try {
>                 validate!string(cast(string) rawdata);
>                 // validation passed, assume it is UTF-8 and use it
>                 utf8string = cast(string) rawdata;
>         } catch(UTFException t) {
>                // not utf-8, try latin1
>                transcode(cast(Latin1String) rawData, utf8string);
>         }
>
>         // now go ahead and use utf8 string, it should be set

Clever solution, thanks.
Coud this work using scope instead of try/catch?

P.S. Nice unit, by the way.


More information about the Digitalmars-d-learn mailing list