[OT] Effect of UTF-8 on 2G connections

Joakim via Digitalmars-d digitalmars-d at puremagic.com
Wed Jun 1 23:46:39 PDT 2016


On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:
> On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
>> On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
>>> It's not hard.  I think a lot of us remember when a 14.4 
>>> modem was cutting-edge.
>>
>> Well, then apparently you're unaware of how bloated web pages 
>> are nowadays.  It used to take me minutes to download popular 
>> web pages _back then_ at _top speed_, and those pages were a 
>> _lot_ smaller.
>
> It's telling that you think the encoding of the text is 
> anything but the tiniest fraction of the problem.  You should 
> look at where the actual weight of a "modern" web page comes 
> from.

I'm well aware that text is a small part of it.  My point is that 
they're not downloading those web pages, they're using mobile 
instead, as I explicitly said in a prior post.  My only point in 
mentioning the web bloat to you is that _your perception_ is off 
because you seem to think they're downloading _current_ web pages 
over 2G connections, and comparing it to your downloads of _past_ 
web pages with modems.  Not only did it take minutes for us back 
then, it takes _even longer_ now.

I know the text encoding won't help much with that.  Where it 
will help is the mobile apps they're actually using, not the 
bloated websites they don't use.

>>> Codepages and incompatible encodings were terrible then, too.
>>>
>>> Never again.
>>
>> This only shows you probably don't know the difference between 
>> an encoding and a code page,
>
> "I suggested a single-byte encoding for most languages, with 
> double-byte for the ones which wouldn't fit in a byte. Use some 
> kind of header or other metadata to combine strings of 
> different languages, _rather than encoding the language into 
> every character!_"
>
> Yeah, that?  That's codepages.  And your exact proposal to put 
> encodings in the header was ALSO tried around the time that 
> Unicode was getting hashed out.  It sucked.  A lot.  (Not as 
> bad as storing it in the directory metadata, though.)

You know what's also codepages?  Unicode.  The UCS is a 
standardized set of code pages for each language, often merely 
picking the most popular code page at that time.

I don't doubt that nothing I'm saying hasn't been tried in some 
form before.  The question is whether that alternate form would 
be better if designed and implemented properly, not if a botched 
design/implementation has ever been attempted.

>>>> Well, when you _like_ a ludicrous encoding like UTF-8, not 
>>>> sure your opinion matters.
>>>
>>> It _is_ kind of ludicrous, isn't it?  But it really is the 
>>> least-bad option for the most text.  Sorry, bub.
>>
>> I think we can do a lot better.
>
> Maybe.  But no one's done it yet.

That's what people said about mobile devices for a long time, 
until about a decade ago.  It's time we got this right.

>> The vast majority of software is written for _one_ language, 
>> the local one.  You may think otherwise because the software 
>> that sells the most and makes the most money is 
>> internationalized software like Windows or iOS, because it can 
>> be resold into many markets.  But as a percentage of lines of 
>> code written, such international code is almost nothing.
>
> I'm surprised you think this even matters after talking about 
> web pages.  The browser is your most common string processing 
> situation.  Nothing else even comes close.

No, it's certainly popular software, but at the scale we're 
talking about, ie all string processing in all software, it's 
fairly small.  And the vast majority of webapps that handle 
strings passed from a browser are written to only handle one 
language, the local one.

>> largely ignoring the possibilities of the header scheme I 
>> suggested.
>
> "Possibilities" that were considered and discarded decades ago 
> by people with way better credentials.  The era of single-byte 
> encodings is gone, it won't come back, and good riddance to bad 
> rubbish.

Lol, credentials. :D If you think that matters at all in the face 
of the blatant stupidity embodied by UTF-8, I don't know what to 
tell you.

>> I could call that "trolling" by all of you, :) but I'll 
>> instead call it what it likely is, reactionary thinking, and 
>> move on.
>
> It's not trolling to call you out for clearly not doing your 
> homework.

That's funny, because it's precisely you and others who haven't 
done your homework.  So are you all trolling me?  By your 
definition of trolling, which btw is not the standard one, _you_ 
are the one doing it.

>> I don't think you understand: _you_ are the special case.
>
> Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
> any sequence of glyphs and combining characters 
> (correctly-formed or not) in any language at any time, so we're 
> the special case...?

And you're doing so by mostly using a single-byte encoding for 
_your own_ Euro-centric languages, ie ASCII, while imposing 
unnecessary double-byte and triple-byte encodings on everyone 
else, despite their outnumbering you 10 to 1.  That is the very 
definition of a special case.

> Yeah, it sounds funny to me, too.

I'm happy to hear you find your privilege "funny," but I'm sorry 
to tell you, it won't last.

>> The 5 billion people outside the US and EU are _not the 
>> special case_.
>
> Fortunately, it works for them to.

At a higher and unneccessary cost, which is why it won't last.

>> The problem is all the rest, and those just below who cannot 
>> afford it at all, in part because the tech is not as efficient 
>> as it could be yet.  Ditching UTF-8 will be one way to make it 
>> more efficient.
>
> All right, now you've found the special case; the case where 
> the generic, unambiguous encoding may need to be lowered to 
> something else: people for whom that encoding is suboptimal 
> because of _current_ network constraints.
>
> I fully acknowledge it's a couple billion people and that's 
> nothing to sneeze at, but I also see that it's a situation that 
> will become less relevant over time.

I continue to marvel at your calling a couple billion people "the 
special case," presumably thinking ~700 million people in the US 
and EU primarily using the single-byte encoding of ASCII are the 
general case.

As for the continued relevance of such constrained use, I suggest 
you read the link Marco provided above.  The vast majority of the 
worlwide literate population doesn't have a smartphone or use a 
cellular data plan, whereas the opposite is true if you include 
featurephones, largely because they can by used only for voice.  
As that article notes, costs for smartphones and 2G data plans 
will have to come down for them to go wider.  That will take 
decades to roll out, though the basic tech design will mostly be 
done now.

The costs will go down by making the tech more efficient, and 
ditching UTF-8 will be one of the ways the tech will be made more 
efficient.


More information about the Digitalmars-d mailing list