The Case Against Autodecode

Tue May 31 13:22:32 PDT 2016

On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
> On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d 
> wrote:
>> UTF-8 is an antiquated hack that needs to be eradicated.  It 
>> forces all other languages than English to be twice as long, 
>> for no good reason, have fun with that when you're downloading 
>> text on a 2G connection in the developing world.  It is 
>> unnecessarily inefficient, which is precisely why 
>> auto-decoding is a problem. It is only a matter of time till 
>> UTF-8 is ditched.
>
> Considering that *nix land uses UTF-8 almost exclusively, and 
> many C libraries do even on Windows, I very much doubt that 
> UTF-8 is going anywhere anytime soon - if ever. The Win32 API 
> does use UTF-16, and Java and C# do, but vast sea of code that 
> is C or C++ generally uses UTF-8 as do plenty of other 
> programming languages.

I agree that both UTF encodings are somewhat popular now.

> And even aside from English, most European languages are going 
> to be more efficient with UTF-8, because they're still 
> primarily ASCII even if they contain characters that are not. 
> Stuff like Chinese is definitely worse in UTF-8 than it would 
> be in UTF-16, but there are a lot of languages other than 
> English which are going to encode better with UTF-8 than UTF-16 
> - let alone UTF-32.

And there are a lot more languages that will be twice as long 
than English, ie ASCII.

> Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too 
> much uses it for it to be going anywhere, and most folks have 
> no problem with that. Any attempt to get rid of it would be a 
> huge, uphill battle.

I disagree, it is inevitable.  Any tech so complex and 
inefficient cannot last long.

> But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even 
> without involving the standard library - so anyone who wants to 
> avoid UTF-8 is free to do so.

Yes, but not by using UTF-16/32, which use too much memory.  I've 
suggested a single-byte encoding for most languages instead, both 
in my last post and the earlier thread.

D could use this new encoding internally, while keeping its 
current UTF-8/16 strings around for any outside UTF-8/16 data 
passed in.  Any of that data run through algorithms that don't 
require decoding could be kept in UTF-8, but the moment any 
decoding is required, D would translate UTF-8 to the new 
encoding, which would be much easier for programmers to 
understand and manipulate. If UTF-8 output is needed, you'd have 
to encode back again.

Yes, this translation layer would be a bit of a pain, but the new 
encoding would be so much more efficient and understandable that 
it would be worth it, and you're already decoding and encoding 
back to UTF-8 for those algorithms now.  All that's changing is 
that you're using a new and different encoding than dchar as the 
default.  If it succeeds for D, it could then be sold more widely 
as a replacement for UTF-8/16.

I think this would be the right path forward, not navigating this 
UTF-8/16 mess further.