Fix Phobos dependencies on autodecoding

Argolis argolis at gmail.com
Thu Aug 15 11:02:54 UTC 2019


On Wednesday, 14 August 2019 at 17:12:00 UTC, H. S. Teoh wrote:

> - Taking substrings: does not need grapheme segmentation; you 
> just slice the string.

What is the use case of slicing some multi-codeunit encoded 
grapheme in the middle?

> - Copying one string to another: does not need grapheme 
> segmentation, - you just use memcpy (or equivalent).
> - Concatenating n strings: does not need grapheme segmentation, 
> you just use memcpy (or equivalent).  In D, you just use array 
> append,  or  std.array.appender if you get fancy.

That use case is not string processing, but general memory 
handling of an opaque type

> - Comparing one string to another: does not need grapheme  
> segmentation;
>   you either use strcmp/memcmp

That use case is not string processing, but general memory 
comparison of an opaque type

>, or if you need more delicate semantics,
> call one of the standard Unicode string collation algorithms 
> (std.uni, meaning, your code does not need to worry about 
> grapheme segmentation, and besides, Unicode collation 
> algorithms operate at the code point  level, not at the 
> grapheme level).

So this use case algorithm needs a proper handling of encoded 
code units, and can't be satisfied simply removing auto decoding

> - Matching a substring: does not need grapheme segmentation;  
> most
>   applications just need subarray matching, i.e., treat the  
> substring as
>   an opaque blob of bytes, and match it against the target.

That use case is not string processing, but general memory 
comparison  of an opaque type

> If  you need more delicate semantics, there are standard 
> Unicode  algorithms for
> substring matching (i.e., user code does not need to worry 
> about the low-level details -- the inputs are basically opaque 
> Unicode strings whose internal structure is unimportant).

Again, removing auto decoding does not change anything for that.

> You really only need grapheme segmentation when:
> - Implementing a text layout algorithm where you need to render 
> glyphs
> to some canvas.
> - Measuring the size of some piece of text for output alignment
>   purposes: in this case, grapheme segmentation isn't enough; 
> you need font size information and other such details (like 
> kerning, spacing parameters, etc.).

What about all the example above in the thread, about the wrong 
way of working of auto decoding right now?

Retro, correct substrings slicing, correct indexing, et cetera

> Ultimately, the whole point behind removing autodecoding is to 
> put the onus on the user code to decide what kind of iteration 
> it wants: code units, code points, or graphemes. (Or just use 
> one of the standard algorithms and don't reinvent the square 
> wheel.)

There will be always a default way to iterate, see below

>> Are they really SO common that the correct default is go for 
>> code points?
>
> The whole point behind removing autodecoding is so that we do 
> NOT default to code points, which is currently the default.  We 
> want to put the choice in the user's hand, not silently default 
> to iteration by code point under the illusion of correctness, 
> which is actually incorrect for non-trivial inputs.

The illusion of correctness should be turned into correctness, 
then.

>> Is it not better to have as a default the grapheme 
>> segmentation, the correct way of handling a string, instead?
>
> Grapheme segmentation is very complex, and therefore, very 
> slow.  Most string processing doesn't actually need grapheme 
> segmentation.

Can you provide string processing that doesn't need grapheme 
segmentation?
The examples listed above are not string processing example.

> Setting that as the default would mean D string processing will 
> be excruciatingly slow by default, and furthermore all that 
> extra work will be mostly for nothing because most of the time 
> we don't need it anyway.

 From the examples above, most of the time you simply need opaque 
memory management, so decaying the string/dstring/wstring to a 
binary blob, but that's not string processing

My (refined) point still stands: can you provide example of (text 
processing) algorithms and use cases that don't need grapheme 
segmentation?



More information about the Digitalmars-d mailing list