UTF-8 Everywhere

Charles Hixson via Digitalmars-d digitalmars-d at puremagic.com
Mon Jun 20 11:34:01 PDT 2016


On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:
> On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
>> To me it seems that a lot of the time processing is more efficient 
>> with UCS-4
>> (what I call utf-32).  Storage is clearly more efficient with utf-8, 
>> but access
>> is more direct with UCS-4.  I agree that utf-8 is generally to be 
>> preferred
>> where it can be efficiently used, but that's not everywhere. The 
>> problem is
>> efficient bi-directional conversion...which D appears to handle 
>> fairly well
>> already with text() and dtext().  (I don't see any utility for 
>> utf-16.  To me
>> that seems like a first attempt that should have been deprecated.)
>
> That seemed to me to be true, too, until I wrote a text processing 
> program using UCS-4. It was rather slow. Turns out, 4x memory 
> consumption has a huge performance cost.
The approach I took (which worked well for my purposes) was to process 
the text a line at a time, and for that the overhead of memory was 
trivial. ... If I'd needed to go back and forth this wouldn't have been 
desirable, but there was one dtext conversion, processing, and then 
several text conversions (of small portions), and it was quite 
efficient.  Clearly this can't be the approach taken in all 
circumstances, but for this purpose it was significantly more efficient 
than any other approach I've tried. It's also true that most of the text 
I handled was actually ASCII, which would have made the most common 
conversion processes simpler.

To me it appears that both cases need to be handled.  The problem is 
documenting the tradeoffs in efficiency.  D seems to already work quite 
well with arrays of dchars, so there may well not be any need for 
development in that area.  Direct indexing of utf-8 arrays, however, is 
a much more complicated thing, which I doubt can ever be as efficient. 
Memory allocation, however, is a separate, though not independent, 
complexity.  If you can work in small chunks then it becomes less important.



More information about the Digitalmars-d mailing list