The Case Against Autodecode
Joakim via Digitalmars-d
digitalmars-d at puremagic.com
Tue May 31 13:22:32 PDT 2016
On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
> On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d
> wrote:
>> UTF-8 is an antiquated hack that needs to be eradicated. It
>> forces all other languages than English to be twice as long,
>> for no good reason, have fun with that when you're downloading
>> text on a 2G connection in the developing world. It is
>> unnecessarily inefficient, which is precisely why
>> auto-decoding is a problem. It is only a matter of time till
>> UTF-8 is ditched.
>
> Considering that *nix land uses UTF-8 almost exclusively, and
> many C libraries do even on Windows, I very much doubt that
> UTF-8 is going anywhere anytime soon - if ever. The Win32 API
> does use UTF-16, and Java and C# do, but vast sea of code that
> is C or C++ generally uses UTF-8 as do plenty of other
> programming languages.
I agree that both UTF encodings are somewhat popular now.
> And even aside from English, most European languages are going
> to be more efficient with UTF-8, because they're still
> primarily ASCII even if they contain characters that are not.
> Stuff like Chinese is definitely worse in UTF-8 than it would
> be in UTF-16, but there are a lot of languages other than
> English which are going to encode better with UTF-8 than UTF-16
> - let alone UTF-32.
And there are a lot more languages that will be twice as long
than English, ie ASCII.
> Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too
> much uses it for it to be going anywhere, and most folks have
> no problem with that. Any attempt to get rid of it would be a
> huge, uphill battle.
I disagree, it is inevitable. Any tech so complex and
inefficient cannot last long.
> But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even
> without involving the standard library - so anyone who wants to
> avoid UTF-8 is free to do so.
Yes, but not by using UTF-16/32, which use too much memory. I've
suggested a single-byte encoding for most languages instead, both
in my last post and the earlier thread.
D could use this new encoding internally, while keeping its
current UTF-8/16 strings around for any outside UTF-8/16 data
passed in. Any of that data run through algorithms that don't
require decoding could be kept in UTF-8, but the moment any
decoding is required, D would translate UTF-8 to the new
encoding, which would be much easier for programmers to
understand and manipulate. If UTF-8 output is needed, you'd have
to encode back again.
Yes, this translation layer would be a bit of a pain, but the new
encoding would be so much more efficient and understandable that
it would be worth it, and you're already decoding and encoding
back to UTF-8 for those algorithms now. All that's changing is
that you're using a new and different encoding than dchar as the
default. If it succeeds for D, it could then be sold more widely
as a replacement for UTF-8/16.
I think this would be the right path forward, not navigating this
UTF-8/16 mess further.
More information about the Digitalmars-d
mailing list