Working with utf
Frits van Bommel
fvbommel at REMwOVExCAPSs.nl
Thu Jun 14 12:08:57 PDT 2007
Derek Parnell wrote:
> On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:
>
>> (The first 256 code points of Unicode are identical to Latin-1)
>
> I was not aware of that. So if one needs to convert from Latin-1 to utf8
> ...
>
> import std.utf;
>
> dchar[] Latin1toUTF32(ubyte[] pLatin1Text)
> {
> dchar[] utf;
>
> utf.length = pLatin1Text.length;
> foreach(i, b; pLatin1Text)
> utf[i] = b;
> return utf;
> }
>
> char[] Latin1toUTF8(ubyte[] pLatin1Text)
> {
> return std.utf.toUTF8(Latin1toUTF32(pLatin1Text));
> }
That'd work, but will allocate more memory than required (5 to 6 times
the length of the Latin-1 text worth of allocation - 4 times for the
utf-32, plus 1 to 2 times for the utf-8). How about this:
---
import std.utf;
char[] Latin1toUTF8(ubyte[] lat1) {
char[] utf8;
// preallocate
utf8.length = lat1.length;
/* optionally preallocate up to 2 * lat1.length characters
instead (you'll never need more than that).
*/
utf8.length = 0;
foreach (latchar; lat1) {
utf8.encode(latchar);
}
}
---
This should allocate 1 to 3 times the length of the Latin-1 text: 1 time
the length as initial allocation, plus a doubling on reallocation if
there are any non-ascii characters. (If I remember the allocation policy
correctly)
It'll 2 times the Latin-1 length if you preallocate that beforehand.
All memory allocation sizes calculated above exclude whatever extra
memory the allocator adds to get a nice round bin-size of course, so
this is more of an estimate; it'll likely be a bit more.
More information about the Digitalmars-d
mailing list