Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Sat May 25 13:50:55 PDT 2013


On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
> >If I take you right you propose to define string as a header that
> >denotes a set of windows in code space? I still fail to see how
> >that would scale see below.
>
> Something like that.  For a multi-language string encoding, the
> header would contain a single byte for every language used in the
> string, along with multiple index bytes to signify the start and
> finish of every run of single-language characters in the string.
> So, a list of languages and a list of pure single-language
> substrings.  This is just off the top of my head, I'm not suggesting
> it is definitive.
[...]

And just how exactly does that help with slicing? If anything, it makes
slicing way hairier and error-prone than UTF-8. In fact, this one point
alone already defeated any performance gains you may have had with a
single-byte encoding. Now you can't do *any* slicing at all without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new headers
to indicate the start/end of every different-language substring. By the
time you're done with all that, you're going way slower than processing
UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're proposing here
is far worse.


T

-- 
The best compiler is between your ears. -- Michael Abrash


More information about the Digitalmars-d mailing list