regarding Latin1 to UTF8 encoding
Hugo Florentino
hugo at acdam.cu
Sun Dec 8 19:33:36 PST 2013
On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:
> On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
>> Is there a way to detect the encoding prior to typecasting/loading
>> the file?
>
> UTF-8 can be detected fairly reliably, but not much luck for other
> encodings. A Windows-1258 and a Latin1 file, for example, are usually
> fairly indistinguishable from a binary perspective - they use the
> same
> numbers, just for different things.
>
> (It is possible to distinguish them if you use some context and
> grammar check kind of things, but that's not easy.)
>
>
> But utf-8 has a neat feature: any non-ascii stuff needs to validate,
> and it is unlikely that random data would correctly validate.
>
> std.utf.validate can do that (though it throws an exception if it
> fails, ugh!)
>
> So here's how I did it in my own characterencodings.d:
>
>
> https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138
>
>
> string utf8string;
> import std.utf;
> try {
> validate!string(cast(string) rawdata);
> // validation passed, assume it is UTF-8 and use it
> utf8string = cast(string) rawdata;
> } catch(UTFException t) {
> // not utf-8, try latin1
> transcode(cast(Latin1String) rawData, utf8string);
> }
>
> // now go ahead and use utf8 string, it should be set
Clever solution, thanks.
Coud this work using scope instead of try/catch?
P.S. Nice unit, by the way.
More information about the Digitalmars-d-learn
mailing list