The Case Against Autodecode

Mon May 30 09:26:47 PDT 2016

Am Thu, 26 May 2016 16:23:16 -0700
schrieb "H. S. Teoh via Digitalmars-d"
<digitalmars-d at puremagic.com>:

> On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> [...]
> > s.walkLength
> > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > s.count!(c => c >= 32) // non-control characters  
> 
> Question: what should count return, given a string containing (1)
> combining diacritics, or (2) Korean text? Or (3) zero-width spaces?
> 
> 
> > Currently the standard library operates at code point level even
> > though inside it may choose to use code units when admissible. Leaving
> > such a decision to the library seems like a wise thing to do.  
> 
> The problem is that often such decisions can only be made by the user,
> because it depends on what the user wants to accomplish.  What should
> count return, given some Unicode string?  If the user wants to determine
> the size of a buffer (e.g., to store a string minus some characters to
> be stripped), then count should return the byte count. If the user wants
> to count the number of matching visual characters, then count should
> return the number of graphemes. If the user wants to determine the
> visual width of the (filtered) string, then count should not be used at
> all, but instead a font metric algorithm.  (I can't think of a practical
> use case where you'd actually need to count code points(!).)

Hey, I was about to answer exactly the same. It reminds me that
a few years ago I proposed making string iteration explicit
by code-unit, code-point and grapheme in "Rust" and there was
virtually no debate about doing it in the sense that to enable
people to write correct code they'd need to understand a
bit of Unicode and pick the right primitive. If you don't know
what to pick you look it up.

-- 
Marco