[Issue 12113] New: A nothrow std.utf.decode with substitution on bad encoding

Sat Feb 8 14:26:48 PST 2014

https://d.puremagic.com/issues/show_bug.cgi?id=12113

           Summary: A nothrow std.utf.decode with substitution on bad
                    encoding
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody at puremagic.com
        ReportedBy: dmitry.olsh at gmail.com

--- Comment #0 from Dmitry Olshansky <dmitry.olsh at gmail.com> 2014-02-08 14:26:46 PST ---
Change the behaviour of decode according to Unicode standard recommendation.
As a bonus dealing with partly broken encoding in the text becomes palatable
which is hardly possible with the current behaviour.

The relevant section of the standard:

5.22 Best Practice for U+FFFD Substitution

When converting text from one character encoding to another, a conversion
algorithm may
encounter unconvertible code units. This is most commonly caused by some sort
of corruption
of the source data, so that it does not correctly follow the specification for
that
character encoding. Examples include dropping a byte in a multibyte encoding
such as
Shift-JIS, improper concatenation of strings, a mismatch between an encoding
declaration
and actual encoding of text, use of non-shortest form for UTF-8, and so on.

...

Whenever an unconvertible offset is reached during conversion of a code
unit sequence:
1. The maximal subpart at that offset should be replaced by a single
U+FFFD.
2. The conversion should proceed at the offset immediately after the maximal
subpart.
---

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------