Working with utf

Thu Jun 14 12:08:57 PDT 2007

Derek Parnell wrote:
> On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:
> 
>> (The first 256 code points of Unicode are identical to Latin-1)
> 
> I was not aware of that. So if one needs to convert from Latin-1 to utf8
> ...
> 
>   import std.utf;
> 
>    dchar[] Latin1toUTF32(ubyte[] pLatin1Text)
>    {
>        dchar[] utf;
> 
>        utf.length = pLatin1Text.length;
>        foreach(i, b; pLatin1Text)
>               utf[i] = b;
>        return utf;
>    }
> 
>    char[] Latin1toUTF8(ubyte[] pLatin1Text)
>    {
>        return std.utf.toUTF8(Latin1toUTF32(pLatin1Text));
>    }

That'd work, but will allocate more memory than required (5 to 6 times 
the length of the Latin-1 text worth of allocation - 4 times for the 
utf-32, plus 1 to 2 times for the utf-8). How about this:
---
import std.utf;

char[] Latin1toUTF8(ubyte[] lat1) {
     char[] utf8;
     // preallocate
     utf8.length = lat1.length;
     /* optionally preallocate up to 2 * lat1.length characters
        instead (you'll never need more than that).
      */
     utf8.length = 0;
     foreach (latchar; lat1) {
         utf8.encode(latchar);
     }
}
---
This should allocate 1 to 3 times the length of the Latin-1 text: 1 time 
the length as initial allocation, plus a doubling on reallocation if 
there are any non-ascii characters. (If I remember the allocation policy 
correctly)
It'll 2 times the Latin-1 length if you preallocate that beforehand.

All memory allocation sizes calculated above exclude whatever extra 
memory the allocator adds to get a nice round bin-size of course, so 
this is more of an estimate; it'll likely be a bit more.