Transparent ANSI to UTF-8 conversion

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Feb 27 12:45:34 PST 2013


28-Feb-2013 00:35, Lubos Pintes пишет:
> I don't understand the CTFE usage in this context. I thought about
> something like
> dchar[] windows_1250=[...];
> Isn't this enough?
> Thank

It's fine. What I've meant is if all you want to do is convert ANSI -> 
UTF8 there is no need to convert to dchar and then to UTF-8 chars.

so the table becomes more like:

char[][] windows_1250_to_UTF8 = [...];

Or rather (far better memory footprint):

char[2][] windows_1250_to_UTF8 = [ ... ];

I think 2 UTF-8 chars should be enough for your codepage.

Then CTFE is just a tool create one table from another :
char[2][] windows_1250UTF = createUTF8Table(windows_1250);

The point is that inside of createUTF8Table you create an array it by 
using new and simple loops + std.utf.encode just like in normal code but 
it'll be CTFE-ed.

Same goes for going backwards - you can treat char[2] as ushort and do 
the tables. Though now it may have gaps due to encoding not being linear 
but rather having some stride with certain period.

>
> Dňa 27. 2. 2013 18:32 Dmitry Olshansky  wrote / napísal(a):
>> 27-Feb-2013 16:20, monarch_dodra пишет:
>>> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>>>> Hi,
>>>> I would like to transparently convert from ANSI to UTF-8 when dealing
>>>> with text files. For example here in Slovakia, virtually every text
>>>> file is in Windows-1250.
>>>> If someone opens a text file, he or she expects that it will work
>>>> properly. So I suppose, that it is not feasible to tell someone "if
>>>> you want to use my program, please convert every text to UTF-8".
>>>>
>>>> To obtain the mapping from ANSI to Unicode for particular code page is
>>>> trivial. Maybe even MultibyteToWidechar could help with this.
>>>>
>>>> I however need to know how to do it "D-way". Could I define something
>>>> like TextReader class? Or perhaps some support already exists
>>>> somewhere?
>>>> Thank
>>>
>>> I'd say the D way would be to simply exploit the fact that UTF is built
>>> into the language, and as such, not worry about encoding, and use raw
>>> code points.
>>>
>>> You get you "Codepage to unicode *codepoint*" table, and then you simply
>>> map each character to a dchar. From there, D will itself convert your
>>> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
>>> example, writing to a file will automatically convert input to UTF-8.
>>> You can also simply use std.conv.to!string to convert any UTF scheme to
>>> UTF-8 (or any other UTF too for that matter).
>>
>> Making a table that translates ANSI to UTF8 is trivially constructible
>> using CTFE from the static one that does ANSI -> dchar.
>>>
>>> This may not be as efficient as a "true" "codepage to UTF8 table" but:
>>> 1) Given you'll most probably be IO bound anyways, who cares?
>>
>> With in-memory transcoding you won't be. Text editors are typically all
>> in-memory or mmap-ed.
>>
>>> 2) Scalability. D does everything but the code page to code point
>>> mapping. Why bother doing any more than that?
>>
>>
>


-- 
Dmitry Olshansky


More information about the Digitalmars-d-learn mailing list