Why UTF-8/16 character encodings?

Peter Alexander peter.alexander.au at gmail.com
Fri May 24 10:43:00 PDT 2013


On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
> This triggered a long-standing bugbear of mine: why are we 
> using these variable-length encodings at all?

Simple: backwards compatibility with all ASCII APIs (e.g. most C 
libraries), and because I don't want my strings to consume 
multiple bytes per character when I don't need it.

Your language header idea is no good for at least three reasons:

1. What happens if I want to take a substring slice of your 
string? I'll need to allocate a new string to add the header in.

2. What if I have a long string with the ASCII header and want to 
append a non-ASCII character on the end? I'll need to reallocate 
the whole string and widen it with the new header.

3. Even if I have a string that is 99% ASCII then I have to pay 
extra bytes for every character just because 1% wasn't ASCII. 
With UTF-8, I only pay the extra bytes when needed.


More information about the Digitalmars-d mailing list