[Issue 3465] New: isIdeographic can be wrong in std.xml
d-bugmail at puremagic.com
d-bugmail at puremagic.com
Sun Nov 1 21:51:26 PST 2009
http://d.puremagic.com/issues/show_bug.cgi?id=3465
Summary: isIdeographic can be wrong in std.xml
Product: D
Version: 2.035
Platform: Other
OS/Version: All
Status: NEW
Severity: minor
Priority: P2
Component: Phobos
AssignedTo: nobody at puremagic.com
ReportedBy: y0uf00bar at gmail.com
--- Comment #0 from hed010gy <y0uf00bar at gmail.com> 2009-11-01 21:51:25 PST ---
The std.xml functionisIdeographic failed my parser on one of the xml
conformance tests for the character 0x4E00.
// As implemented in XML Piece Parser Project, http://source.miryn.org/
// but I took it from std.xml
//WRONG in std.xml
//invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029];
//RIGHT, because for lookup function,
// the table data range pairs should be ordered!
dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5];
// PERFORMANCE SUGGESTION
// also lookup is best done for tables that are larger
// for smaller tables, like this one, or character,
// surely a hard coded search will be faster
// Surely not much more code, is generated for this.
// and faster, since no function call to lookup, and no array slices used.
bool isIdeographic(dchar c)
{
if (c == 0x3007)
return true;
if (c >= 0x3007 && c <= 0x3029)
return true;
if (c >= 0x4E00 && c <= 0x9FA5)
return true;
return false;
}
// Only suggestion here..
// isChar has to be called for every single character in the document, and
// it must be worth a bit of optimisation,
// especially for common cases.
/**
* Returns true if the character is a character according to the XML standard
* Character references must refer to one of these.
* Any unicode character, excluding surrogate blocks FFFE and FFFF.
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
* Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
* Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0)
*
* Params:
* c = the character to be tested
* The standard ASCII case gets at most 3 value comparisons.
*/
bool isChar(dchar c)
{
if (c <= 0xD7FF)
{
if (c >= 0x20)
{
if (c >= 0x7F)
{
if (c <= 0x84)
return false;
if (c >= 0x86)
{
if (c <= 0x9F)
return false;
}
}
return true;
}
switch(c)
{
case 0xA:
case 0x9:
case 0xD:
return true;
default:
return false;
}
}
else if (c >= 0xE000)
{
if (c < 0xFFFE)
{
if (c >= 0xFDD0 && c <= 0xFDEF)
return false;
return true;
}
if (c >= 0x10000)
{
if (c <= 0x10FFFF)
{
/* some conformance tests have the 0x10FFFF
if ((c & 0xFFFE) == 0xFFFE)
{
return false;
}
*/
return true;
}
}
}
return false;
}
// Most digits are expected to be ASCII ones
bool isDigit(dchar c)
{
if (c <= 0x0039 && c >= 0x0030)
return true;
else
return lookup(DigitTable,c);
}
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
More information about the Digitalmars-d-bugs
mailing list