Of possible interest: fast UTF8 validation

Fri May 18 05:31:49 UTC 2018

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
> Digitalmars-d wrote: [...]
>> [...]
>
> Yes.  Imagine if we standardized on a header-based string 
> encoding, and we wanted to implement a substring function over 
> a string that contains multiple segments of different 
> languages. Instead of a cheap slicing over the string, you'd 
> need to scan the string or otherwise keep track of which 
> segment the start/end of the substring lies in, allocate memory 
> to insert headers so that the segments are properly 
> interpreted, etc.. It would be an implementational nightmare, 
> and an unavoidable performance hit (you'd have to copy data 
> every time you take a substring), and the @nogc guys would be 
> up in arms.
>
> [...]
That's what rtf with code pages was essentially. I'm happy that 
we got rid of it and that they were replaced by xml, even if 
Microsoft's document xml being a bloated, ridiculous mess, it's 
still an order of magnitude less problematic than rtf (I mean at 
the text encoding level).