Challenge: write a really really small front() for UTF8

dennis luehring dl.soluz at gmx.net
Mon Mar 24 23:39:59 PDT 2014


Am 24.03.2014 17:44, schrieb Andrei Alexandrescu:
> On 3/24/14, 5:51 AM, w0rp wrote:
>> On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:
>>> On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:
>>>> Here's a baseline: http://goo.gl/91vIGc. Destroy!
>>>>
>>>> Andrei
>>>
>>> Before we roll this out, could we discuss a strategy/guideline in
>>> regards to detecting and handling invalid UTF sequences?
>>>
>>> Having a fast "front" is fine and all, but if it means your program
>>> asserting in release (or worst, silently corrupting memory) just
>>> because the client was trying to read a bad text file, I'm unsure this
>>> is acceptable.
>>
>> I would strongly advise to at least offer an option
>
> Options are fine for functions etc. But front would need to find an
> all-around good compromise between speed and correctness.
>
> Andrei
>

b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
sequence removed.

i think there should be a base range for UTF8 iteration - with policy 
based error extension (like in python) and some variants that defer this 
base UTF8 range with different error behavior - and one of these become 
the phobos standard = default parameter so its still switchable




More information about the Digitalmars-d mailing list