Transparent ANSI to UTF-8 conversion

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Feb 27 09:32:51 PST 2013


27-Feb-2013 16:20, monarch_dodra пишет:
> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>> Hi,
>> I would like to transparently convert from ANSI to UTF-8 when dealing
>> with text files. For example here in Slovakia, virtually every text
>> file is in Windows-1250.
>> If someone opens a text file, he or she expects that it will work
>> properly. So I suppose, that it is not feasible to tell someone "if
>> you want to use my program, please convert every text to UTF-8".
>>
>> To obtain the mapping from ANSI to Unicode for particular code page is
>> trivial. Maybe even MultibyteToWidechar could help with this.
>>
>> I however need to know how to do it "D-way". Could I define something
>> like TextReader class? Or perhaps some support already exists somewhere?
>> Thank
>
> I'd say the D way would be to simply exploit the fact that UTF is built
> into the language, and as such, not worry about encoding, and use raw
> code points.
>
> You get you "Codepage to unicode *codepoint*" table, and then you simply
> map each character to a dchar. From there, D will itself convert your
> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
> example, writing to a file will automatically convert input to UTF-8.
> You can also simply use std.conv.to!string to convert any UTF scheme to
> UTF-8 (or any other UTF too for that matter).

Making a table that translates ANSI to UTF8 is trivially constructible 
using CTFE from the static one that does ANSI -> dchar.
>
> This may not be as efficient as a "true" "codepage to UTF8 table" but:
> 1) Given you'll most probably be IO bound anyways, who cares?

With in-memory transcoding you won't be. Text editors are typically all 
in-memory or mmap-ed.

> 2) Scalability. D does everything but the code page to code point
> mapping. Why bother doing any more than that?


-- 
Dmitry Olshansky


More information about the Digitalmars-d-learn mailing list