[OT] Effect of UTF-8 on 2G connections

Wed Jun 1 11:30:25 PDT 2016

On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
> On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
>> It's not hard.  I think a lot of us remember when a 14.4 modem 
>> was cutting-edge.
>
> Well, then apparently you're unaware of how bloated web pages 
> are nowadays.  It used to take me minutes to download popular 
> web pages _back then_ at _top speed_, and those pages were a 
> _lot_ smaller.

It's telling that you think the encoding of the text is anything 
but the tiniest fraction of the problem.  You should look at 
where the actual weight of a "modern" web page comes from.

>> Codepages and incompatible encodings were terrible then, too.
>>
>> Never again.
>
> This only shows you probably don't know the difference between 
> an encoding and a code page,

"I suggested a single-byte encoding for most languages, with 
double-byte for the ones which wouldn't fit in a byte. Use some 
kind of header or other metadata to combine strings of different 
languages, _rather than encoding the language into every 
character!_"

Yeah, that?  That's codepages.  And your exact proposal to put 
encodings in the header was ALSO tried around the time that 
Unicode was getting hashed out.  It sucked.  A lot.  (Not as bad 
as storing it in the directory metadata, though.)

>>> Well, when you _like_ a ludicrous encoding like UTF-8, not 
>>> sure your opinion matters.
>>
>> It _is_ kind of ludicrous, isn't it?  But it really is the 
>> least-bad option for the most text.  Sorry, bub.
>
> I think we can do a lot better.

Maybe.  But no one's done it yet.

> The vast majority of software is written for _one_ language, 
> the local one.  You may think otherwise because the software 
> that sells the most and makes the most money is 
> internationalized software like Windows or iOS, because it can 
> be resold into many markets.  But as a percentage of lines of 
> code written, such international code is almost nothing.

I'm surprised you think this even matters after talking about web 
pages.  The browser is your most common string processing 
situation.  Nothing else even comes close.

> largely ignoring the possibilities of the header scheme I 
> suggested.

"Possibilities" that were considered and discarded decades ago by 
people with way better credentials.  The era of single-byte 
encodings is gone, it won't come back, and good riddance to bad 
rubbish.

> I could call that "trolling" by all of you, :) but I'll instead 
> call it what it likely is, reactionary thinking, and move on.

It's not trolling to call you out for clearly not doing your 
homework.

> I don't think you understand: _you_ are the special case.

Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
any sequence of glyphs and combining characters (correctly-formed 
or not) in any language at any time, so we're the special case...?

Yeah, it sounds funny to me, too.

> The 5 billion people outside the US and EU are _not the special 
> case_.

Fortunately, it works for them to.

> The problem is all the rest, and those just below who cannot 
> afford it at all, in part because the tech is not as efficient 
> as it could be yet.  Ditching UTF-8 will be one way to make it 
> more efficient.

All right, now you've found the special case; the case where the 
generic, unambiguous encoding may need to be lowered to something 
else: people for whom that encoding is suboptimal because of 
_current_ network constraints.

I fully acknowledge it's a couple billion people and that's 
nothing to sneeze at, but I also see that it's a situation that 
will become less relevant over time.

-Wyatt