Creeping Bloat in Phobos

Sun Sep 28 10:03:29 PDT 2014

On Sunday, 28 September 2014 at 14:38:57 UTC, H. S. Teoh via 
Digitalmars-d wrote:
> On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz via 
> Digitalmars-d wrote:
>> On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei 
>> Alexandrescu wrote:
>> >On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>> >>If we can get Andrei on board, I'm all for killing off 
>> >>autodecoding.
>> >
>> >That's rather vague; it's unclear what would replace it. -- 
>> >Andrei
>> 
>> I believe that removing autodeconding will make things even 
>> worse. As
>> far as understand if we will remove it from front() function 
>> that
>> operates on narrow strings then it will return just byte of 
>> char. I
>> believe that proceeding on narrow string by `user perceived 
>> chars`
>> (graphemes) is more common use case.
> [...]
>
> Unfortunately this is not what autodecoding does today. Today's
> autodecoding only segments strings into code *points*, which 
> are not the
> same thing as graphemes. For example, combining diacritics are 
> normally
> not considered separate characters from the user's POV, but 
> they *are*
> separate codepoints from their base character. The only reason 
> today's
> autodecoding is even remotely considered "correct" from an 
> intuitive POV
> is because most Western character sets happen to use only 
> precomposed
> characters rather than combining diacritic sequences. If you 
> were
> processing, say, Korean text, the present autodecoding .front 
> would
> *not* give you what you might imagine is a "single character"; 
> it would
> only be halves of Korean graphemes. Which, from a user's POV, 
> would
> suffer from the same issues as dealing with individual bytes in 
> a UTF-8
> stream -- any mistake on the program's part in handling these 
> half-units
> will cause "corruption" of the text (not corruption in the same 
> sense as
> an improperly segmented UTF-8 byte stream, but in the sense 
> that the
> wrong glyphs will be displayed on the screen -- from the user's 
> POV
> these two are basically the same thing).
>
> You might then be tempted to say, well let's make .front return
> graphemes instead. That will solve the "single intuitive 
> character"
> issue, but the performance will be FAR worse than what it is 
> today.
>
> So basically, what we have today is neither efficient nor 
> complete, but
> a halfway solution that mostly works for Western character sets 
> but
> is incomplete for others. We're paying efficiency for only a 
> partial
> benefit. Is it worth the cost?
>
> I think the correct solution is not for Phobos to decide for the
> application at what level of abstraction a string ought to be 
> processed.
> Rather, let the user decide. If they're just dealing with 
> opaque blocks
> of text, decoding or segmenting by grapheme is completely 
> unnecessary --
> they should just operate on byte ranges as opaque data. They 
> should use
> byCodeUnit. If they need to work with Unicode codepoints, let 
> them use
> byCodePoint. If they need to work with individual user-perceived
> characters (i.e., graphemes), let them use byGrapheme.
>
> This is why I proposed the deprecation path of making it 
> illegal to pass
> raw strings to Phobos algorithms -- the caller should specify 
> what level
> of abstraction they want to work with -- byCodeUnit, 
> byCodePoint, or
> byGrapheme. The standard library's job is to empower the D 
> programmer by
> giving him the choice, not to shove a predetermined solution 
> down his
> throat.
>
>
> T

I totally agree with all of that.

It's one of those cases where correct by default is far too slow 
(that would have to be graphemes) but fast by default is far too 
broken. Better to force an explicit choice.

There is no magic bullet for unicode in a systems language such 
as D. The programmer must be aware of it and make choices about 
how to treat it.