Transparent ANSI to UTF-8 conversion
Dmitry Olshansky
dmitry.olsh at gmail.com
Wed Feb 27 12:45:34 PST 2013
28-Feb-2013 00:35, Lubos Pintes пишет:
> I don't understand the CTFE usage in this context. I thought about
> something like
> dchar[] windows_1250=[...];
> Isn't this enough?
> Thank
It's fine. What I've meant is if all you want to do is convert ANSI ->
UTF8 there is no need to convert to dchar and then to UTF-8 chars.
so the table becomes more like:
char[][] windows_1250_to_UTF8 = [...];
Or rather (far better memory footprint):
char[2][] windows_1250_to_UTF8 = [ ... ];
I think 2 UTF-8 chars should be enough for your codepage.
Then CTFE is just a tool create one table from another :
char[2][] windows_1250UTF = createUTF8Table(windows_1250);
The point is that inside of createUTF8Table you create an array it by
using new and simple loops + std.utf.encode just like in normal code but
it'll be CTFE-ed.
Same goes for going backwards - you can treat char[2] as ushort and do
the tables. Though now it may have gaps due to encoding not being linear
but rather having some stride with certain period.
>
> Dňa 27. 2. 2013 18:32 Dmitry Olshansky wrote / napísal(a):
>> 27-Feb-2013 16:20, monarch_dodra пишет:
>>> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>>>> Hi,
>>>> I would like to transparently convert from ANSI to UTF-8 when dealing
>>>> with text files. For example here in Slovakia, virtually every text
>>>> file is in Windows-1250.
>>>> If someone opens a text file, he or she expects that it will work
>>>> properly. So I suppose, that it is not feasible to tell someone "if
>>>> you want to use my program, please convert every text to UTF-8".
>>>>
>>>> To obtain the mapping from ANSI to Unicode for particular code page is
>>>> trivial. Maybe even MultibyteToWidechar could help with this.
>>>>
>>>> I however need to know how to do it "D-way". Could I define something
>>>> like TextReader class? Or perhaps some support already exists
>>>> somewhere?
>>>> Thank
>>>
>>> I'd say the D way would be to simply exploit the fact that UTF is built
>>> into the language, and as such, not worry about encoding, and use raw
>>> code points.
>>>
>>> You get you "Codepage to unicode *codepoint*" table, and then you simply
>>> map each character to a dchar. From there, D will itself convert your
>>> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
>>> example, writing to a file will automatically convert input to UTF-8.
>>> You can also simply use std.conv.to!string to convert any UTF scheme to
>>> UTF-8 (or any other UTF too for that matter).
>>
>> Making a table that translates ANSI to UTF8 is trivially constructible
>> using CTFE from the static one that does ANSI -> dchar.
>>>
>>> This may not be as efficient as a "true" "codepage to UTF8 table" but:
>>> 1) Given you'll most probably be IO bound anyways, who cares?
>>
>> With in-memory transcoding you won't be. Text editors are typically all
>> in-memory or mmap-ed.
>>
>>> 2) Scalability. D does everything but the code page to code point
>>> mapping. Why bother doing any more than that?
>>
>>
>
--
Dmitry Olshansky
More information about the Digitalmars-d-learn
mailing list