The Case Against Autodecode

Joakim via Digitalmars-d digitalmars-d at puremagic.com
Wed Jun 1 06:57:27 PDT 2016


On Wednesday, 1 June 2016 at 10:04:42 UTC, Marc Schütz wrote:
> On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
>> UTF-8 is an antiquated hack that needs to be eradicated.  It 
>> forces all other languages than English to be twice as long, 
>> for no good reason, have fun with that when you're downloading 
>> text on a 2G connection in the developing world.
>
> I assume you're talking about the web here. In this case, plain 
> text makes up only a minor part of the entire traffic, the 
> majority of which is images (binary data), javascript and 
> stylesheets (almost pure ASCII), and HTML markup (ditto). It's 
> like not significant even without taking compression into 
> account, which is ubiquitous.

No, I explicitly said not the web in a subsequent post.  The 
ignorance here of what 2G speeds are like is mind-boggling.

>> It is unnecessarily inefficient, which is precisely why 
>> auto-decoding is a problem.
>
> No, inefficiency is the least of the problems with 
> auto-decoding.

Right... that's why this 200-post thread was spawned with that as 
the main reason.

>> It is only a matter of time till UTF-8 is ditched.
>
> This is ridiculous, even if your other claims were true.

The UTF-8 encoding is what's ridiculous.

>>
>> D devs should lead the way in getting rid of the UTF-8 
>> encoding, not bickering about how to make it more palatable.  
>> I suggested a single-byte encoding for most languages, with 
>> double-byte for the ones which wouldn't fit in a byte.  Use 
>> some kind of header or other metadata to combine strings of 
>> different languages, _rather than encoding the language into 
>> every character!_
>
> I think I remember that post, and - sorry to be so blunt - it 
> was one of the worst things I've ever seen proposed regarding 
> text encoding.

Well, when you _like_ a ludicrous encoding like UTF-8, not sure 
your opinion matters.

>>
>> The common string-handling use case, by far, is strings with 
>> only one language, with a distant second some substrings in a 
>> second language, yet here we are putting the overhead into 
>> every character to allow inserting characters from an 
>> arbitrary language!  This is madness.
>
> No. The common string-handling use case is code that is unaware 
> which script (not language, btw) your text is in.

Lol, this may be the dumbest argument put forth yet.

I don't think anyone here even understands what a good encoding 
is and what it's for, which is why there's no point in debating 
this.


More information about the Digitalmars-d mailing list