Challenge: write a really really small front() for UTF8

w0rp devw0rp at gmail.com
Mon Mar 24 05:51:04 PDT 2014


On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
> wrote:
>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>
>> Andrei
>
> Before we roll this out, could we discuss a strategy/guideline 
> in regards to detecting and handling invalid UTF sequences?
>
> Having a fast "front" is fine and all, but if it means your 
> program asserting in release (or worst, silently corrupting 
> memory) just because the client was trying to read a bad text 
> file, I'm unsure this is acceptable.

I would strongly advise to at least offer an option, possibly via 
a template parameter, for turning error handling on or off, 
similar to how Python handles decoding. Examples below in Python 
3.

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character 
used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid 
sequence removed.

All three strategies are useful from time to time. I mainly reach 
for option three when I'm trying to get some text data out of 
some old broken databases or similar.

We may consider leaving the error checking on in -release for the 
'strict' decoding, but throwing an Error instead of an exception 
so the function can be nothrow. This would prevent memory 
corruption in release code. assert vs throw Error is up for 
debate.


More information about the Digitalmars-d mailing list