Challenge: write a really really small front() for UTF8

dennis luehring dl.soluz at gmx.net
Mon Mar 24 05:53:47 PDT 2014


Am 24.03.2014 13:51, schrieb w0rp:
> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
>> wrote:
>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>
>>> Andrei
>>
>> Before we roll this out, could we discuss a strategy/guideline
>> in regards to detecting and handling invalid UTF sequences?
>>
>> Having a fast "front" is fine and all, but if it means your
>> program asserting in release (or worst, silently corrupting
>> memory) just because the client was trying to read a bad text
>> file, I'm unsure this is acceptable.
>
> I would strongly advise to at least offer an option, possibly via
> a template parameter, for turning error handling on or off,
> similar to how Python handles decoding. Examples below in Python
> 3.
>
> b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
> b"\255".decode("utf-8", errors="replace") # replacement character
> used
> b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
> sequence removed.
>
> All three strategies are useful from time to time. I mainly reach
> for option three when I'm trying to get some text data out of
> some old broken databases or similar.
>
> We may consider leaving the error checking on in -release for the
> 'strict' decoding, but throwing an Error instead of an exception
> so the function can be nothrow. This would prevent memory
> corruption in release code. assert vs throw Error is up for
> debate.
>

+1


More information about the Digitalmars-d mailing list