std.algorithm.remove and principle of least astonishment

Mon Nov 22 05:57:39 PST 2010

On Mon, 22 Nov 2010 07:34:15 -0500
Michel Fortin <michel.fortin at michelf.com> wrote:

> Just to add to the compexity: graphemes aren't always equivalent to 
> user-perceived characters either. Ligatures can contain more than one 
> user-perceived characters. If you're looking for the substring 
> "flourish" in a string, should it fail to match when it encounters 
> "ﬂourish" just because of the "ﬂ" (fl) ligature? On most Mac 
> applications it matches both thanks to sensible defaults in NSString's 
> search and comparison algorithms.

That's true. I guess you're thinking at the distinction between NFD/NFC "canonical forms" and NFKD/NFKC ones (so-called "compatibility").

> So perhaps we need yet another layer over graphemes to represent 
> user-perceived characters.

In my view, this is not the responsability of a general-purpose tool. I guess, but may be wrong, we are clearly entering the field of app logics and semantics. These are for me _not_ general-purpose points (but builtin types & libraries often offer clearly non-general routines like one dealing with casing, or even less general: the set of ASCII letters). These issues would have to be dealt with either by apps or by domain-specific libraries.
I find it wrong that Unicode even simply provides standard canonical forms for them (but fortunately common libs do not implement them AFAIK)

denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com