Unicode's proper level of abstraction? [was: Re: VLERange:...]

spir denis.spir at gmail.com
Thu Jan 13 01:49:31 PST 2011


On 01/13/2011 01:45 AM, Michel Fortin wrote:
> On 2011-01-12 14:57:58 -0500, spir <denis.spir at gmail.com> said:
>
>> On 01/12/2011 08:28 PM, Don wrote:
>>> I think the only problem that we really have, is that "char[]",
>>> "dchar[]" implies that code points is always the appropriate level of
>>> abstraction.
>>
>> I'd like to know when it happens that codepoint is the appropriate
>> level of abstraction.
>
> I agree with you. I don't see many use for code points.
>
> One of these uses is writing a parser for a format defined in term of
> code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper 
level of abstraction: a linguistic app of which one operational func 
counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you 
see what I mean.
Once the text is properly NFD decomposed, each of those marks in coded 
as a codepoint. (But if it's not decomposed, then most of those marks 
are probably hidden by precomposed codes coding characters like "ä".) So 
that even such an app benefits from a higher-level type basically 
operating on normalised (NFD) characters.

>> * If pieces of text are not manipulated, meaning just used in the
>> application, or just transferred via the application as is (from file
>> / input / literal to any kind of output), then any kind of encoding
>> just works. One can even concatenate, provided all pieces use the same
>> encoding. --> _lower_ level than codepoint is OK.
>> * But any of manipulation (indexing, slicing, compare, search, count,
>> replace, not to speak about regex/parsing) requires operating at the
>> _higher_ level of characters (in the common sense). Just like with
>> historic character sets in which codes used to represent characters
>> (not lower-level thingies as in UCS). Else, one reads, compares,
>> changes meaningless bits of text.
>
> Very true. In the same way that code points can span on multiple code
> units, user-perceived characters (graphemes) can span on multiple code
> points.
>
> A funny exercise to make a fool of an algorithm working only with code
> points would be to replace the word "fortune" in a text containing the
> word "fortuné". If the last "é" is expressed as two code points, as "e"
> followed by a combining acute accent (this: é), replacing occurrences of
> "fortune" by "expose" would also replace "fortuné" with "exposé" because
> the combining acute accent remains as the code point following the word.
> Quite amusing, but it doesn't really make sense that it works like that.
>
> In the case of "é", we're lucky enough to also have a pre-combined
> character to encode it as a single code point, so encountering "é"
> written as two code points is quite rare. But not all combinations of
> marks and characters can be represented as a single code point. The
> correct thing to do is to treat "é" (single code point) and "é" ("e" +
> combining acute accent) as equivalent.

You'll find another example in the introduction of the text at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction

About your last remark, this is precisely one of the two abstractions my 
Text type provides: it groups togeter in "piles" codes that belong to 
the same "true" character (grapheme) like "é". So that the resulting 
text representation is a sequence of "piles", each representing a 
character. Consequence: indexing, slicing, etc work sensibly (and even 
other operations are faster for they do not need to perform that 
"piling" again & again).
In addition to that, the string is first NFD-normalised, thus each 
chraracter can have one & only representation. Consequence: search, 
count, replace, etc, and compare (*) work as expected. In your case:
     // 2 forms of "é"
     assert(Text("\u00E9") == Text("\u0065\u0301"));

Denis

(*) According to UCS coding, not language-specific idiosyncrasies.
More generally, Text abstract from lower-level issues _introduced_ by 
UCS, Unicode's character set. It does not code with script-, language-, 
culture-, domain-, app- specific needs such as custom text sorting 
rules. Some base routines for such operations are provided by Text's 
brother lib DUnicode (access to some code properties, safe concat, 
casefolded compare, NF* normalisation).
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list