Encodings

Sun Apr 8 15:03:20 PDT 2012

On Sunday, April 08, 2012 23:36:23 Nathan M. Swan wrote:
> For most of the string processing I do, I read/write text in
> UTF-8 and convert it to UTF-32 for processing (with std.utf), so
> I don't have to worry about encoding. Is this a good or bad
> paradigm? Is there a better way to do this? What method do all of
> you use?
> 
> Just curious, NMS

It depends on what you're doing. Depending on the functions that you use and 
your memory requirements, UTF-8 may be faster or UTF-32 may be faster. UTF-32 
has the advantage of being a random-access range, which will make it work with 
a number of functions that UTF-8 won't work with. But UTF-32 also takes 
considerably more memory (especially if most of your characters are ASCII 
characters), which can be a problem.

I think that the most common thing is to just operate on UTF-8 unless another 
encoding is needed (e.g. UTF-32 is required because random-access is needed), 
and in plenty of cases, you end up operating on generic ranges anyway if you 
use range-based functions on strings and don't use std.array.array on them.

You're going to have to profile your code to see whether using UTF-8 or UTF-32 
primarily in your string-processing is more efficient.

- Jonathan M Davis