[Issue 3465] New: isIdeographic can be wrong in std.xml

d-bugmail at puremagic.com d-bugmail at puremagic.com
Sun Nov 1 21:51:26 PST 2009


http://d.puremagic.com/issues/show_bug.cgi?id=3465

           Summary: isIdeographic can be wrong in std.xml
           Product: D
           Version: 2.035
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Phobos
        AssignedTo: nobody at puremagic.com
        ReportedBy: y0uf00bar at gmail.com


--- Comment #0 from hed010gy <y0uf00bar at gmail.com> 2009-11-01 21:51:25 PST ---
The std.xml functionisIdeographic failed my parser on one of the xml
conformance tests for the character 0x4E00.

// As implemented in XML Piece Parser Project,  http://source.miryn.org/
// but I took it from std.xml

//WRONG in std.xml
//invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029];

//RIGHT, because for lookup function,
// the table data range pairs should be ordered!
dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5];

// PERFORMANCE SUGGESTION
// also lookup is best done for tables that are larger
// for smaller tables, like this one, or character, 
// surely a hard coded search will be faster


// Surely not much more code, is generated for this.
// and faster, since no function call to lookup, and no array slices used.

bool isIdeographic(dchar c)
{
    if (c == 0x3007)
        return true;
    if (c >= 0x3007 && c <= 0x3029)
        return true;
    if (c >= 0x4E00 && c <= 0x9FA5)
        return true;
    return false;
}

// Only suggestion here..
// isChar has to be called for every single character in the document, and 
//    it must be worth a bit of optimisation,
//     especially for common cases.

/**
 * Returns true if the character is a character according to the XML standard
 * Character references must refer to one of these.
 * Any unicode character, excluding surrogate blocks FFFE and FFFF.
 * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 * Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
 * Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0)
 *
 * Params:
 *    c = the character to be tested
 *    The standard ASCII case gets at most 3 value comparisons.
  */
bool isChar(dchar c) 
{
    if (c <= 0xD7FF)
    {
        if (c >= 0x20)
        {
            if (c >= 0x7F)
            {
                if (c <= 0x84)
                    return false;
                if (c >= 0x86)
                {
                    if (c <= 0x9F)
                        return false;
                }
            }
            return true;
        }
        switch(c)
        {
        case 0xA:
        case 0x9:
        case 0xD:
            return true;
        default:
            return false;
        }
    }
    else if (c >= 0xE000)
    {
        if (c < 0xFFFE)
        {
            if (c >= 0xFDD0 && c <= 0xFDEF)
                return false;
            return true;
        }
        if (c >= 0x10000)
        {
            if (c <= 0x10FFFF)
            {
        /* some conformance tests have the 0x10FFFF
                if ((c & 0xFFFE) == 0xFFFE)
                {
                    return false; 
                }
        */
                return true;
            }
        }
    }
    return false;
}

// Most digits are expected to be ASCII ones
bool isDigit(dchar c)
{
    if (c <= 0x0039 && c >= 0x0030)
        return true;
    else
        return lookup(DigitTable,c);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------


More information about the Digitalmars-d-bugs mailing list