Update #1 on new std.uni

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Jan 16 12:16:46 PST 2013


16-Jan-2013 23:35, Walter Bright пишет:
> On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
>> I've spent some hours to get an easy, useful and correct (as far as it
>> gets)
>> terminology throughout the module.
>
> Thank you. Looking at the Terminology section (the reference to it at
> the beginning should be a hyperlink):
>
> "Not all code points are assigned to encoded characters.":
>
>      ?? I thought that was the whole point?
>

Obviously it's not. The simple truth is only 110K+ codepoints are 
currently assigned everything else is either reserved for speicific use 
(internal, surrogates etc.) or for future additions.

Thin of it this way - previously they thought a unit of symbolic info 
was 16 bits. Then a lot of problems cropped up (including adapting 8-bit 
only/Latin-1/ascii systems). Now they treat encoding as form of data 
storage and everything in the Unicode standard defined as operating on 
_values_ of codespace (that is code points).

This doesn't mean that all these values are real characters.

> "Note that UTF-32 code unit (dchar) holds the actual code point value."
> => "Note that in UTF-32, a code unit is a code point and is represented
> by the D dchar type."

Yeah, that's simpler.

>
> What happened to "octet", which I thought was the official term?

Octet means 8-bits as a unit of data in just about any Internet protocol 
description. What do you what it here for?

>
> "Also known as simply character."
>
>      No, please no, at least not in this document. I suspect you need to
> ban the word "character" from this page. It is so overloaded in meaning
> that it is useless.
>

It's not me. It's just the way things are and people have to be aware 
this particular meaning of character.

Trust me, I've first converted the whole thing to code point(s) (note 
the space). Then I read it and it was like "meh" even to me who wrote 
the stuff in the first place. Then I looked at all documents by Unicode 
consortium they use 'character' throughout to mean either encoded 
character or abstract character depending on the time of day.

Utility beats pedantry and thus 'character' is 'encoded character' 
everywhere it's used. Probably I'll remove this statement so that 
uninitiated don't fixate on it too much.

> "An abstract character does not necessarily correspond to what a user
> thinks of as a “character” and should not be confused with a Grapheme."
>
>      This just makes me cry. Who knows what a user thinks of as a
> character? "not necessarily" means what? Is "Grapheme" a Unicode term?
>

Yes and yes. The user thinks a character is what he usually writes on a 
piece of paper obviously. It's a warning that what a typical user thinks 
a character doesn't always match the character in the Unicode.

Grapheme is a unicode term that has direct substitute in this library 
hence the hyper-link. Grapheme are _visually_ a single entity. In fact 
the could be a sequence consisting of a base character (Unicode standard 
says 'character') + sequence of combining marks. This is called 
combining character sequence, but there are other kinds of grapheme see 
e.g. Hangul syllables and more weird stuff.

BTW the text is taken as is from the Unicode standard definitions.
I can litter a couple more pages with this crap, but instead I've taken
the most important ones and merged a couple of definitions to make it 
simpler (e.g. see glyph).

> Why can't there be precise definitions of these terms?

'cause they are muddy ;)
'Abstract character' is one of the ugliest actually.

 >I wonder if even
> the Unicode standard people have no idea exactly what they are.
>
> Sorry for the rant, but unicode terms always make me mad.

Nothing I can cure here. And they know what they do but have a boatload 
of legacy crap that has to crawl on. Just recall the time where 
everybody thought that 16-bit codes should be enough for everybody. When 
they gradually took the whole world of writing systems into account 
things got dirty, awfully so.

After learning a lot about Unicode I'd say they managed to do the job 
just fine given the constraints. I, for one, wouldn't even dare to try 
to reinvent *these* wheels. A lot of stuff could have been smoothed and 
simplified but overall - it's complex because the writing systems are 
utterly irregular plus the backwards compatibility has taken its toll.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list