dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Patrick Schluter Patrick.Schluter at bbox.fr
Sat Nov 6 08:33:07 UTC 2021


On Saturday, 6 November 2021 at 06:17:55 UTC, Alexey wrote:
> On Saturday, 6 November 2021 at 05:36:07 UTC, H. S. Teoh wrote:
>>
>> Unfortunately, codepoint != grapheme. This was the fundamental 
>> error with autodecoding that made it so bad. It costs us a 
>> performance hit but doesn't even produce the right results in 
>> return.
>>
>> And even more unfortunately, grapheme segmentation is an 
>> extremely convoluted (i.e. slow) operation that normally you 
>> would *not* want to do it unless your code absolutely has to.
>>
>>
>> T
>
> ```D
> struct graphstring
> {
>     grapheme[] grapheme_elements;
> }
>
> struct grapheme
> {
>     dchar[] codepoints;
> }
>
> ```
> Would this really be _that_ slow? also, there is no need to do 
> error checks on every action which user may do with 
> graphstrings: no need to check on concatenations or slicings, 
> for instance. but do checks on conversions from other 
> string/ubyte[] types and to those types.

This is 1 grapheme A̶͙̜͚̫̬̻ͅ


(U+0041 U+0336 U+0359 U+0345 U+031c U+035a U+032b U+032c U+033b) 
but 9 codepoints (9 dchar, 9 wchar, 17 char (0x41 0xcc 0xb6 0xcd 
0x99 0xcd 0x85 0xcc 0x9c 0xcd 0x9a 0xcc 0xab 0xcc 0xac 0xcc 0xbb)


More information about the Digitalmars-d mailing list