Parsing with dxml

Tue Nov 19 02:45:29 UTC 2019

On Sunday, November 17, 2019 11:44:43 PM MST Joel via Digitalmars-d-learn 
wrote:
> I can only parse one row successfully. I tried increasing the
> popFronts, till it said I'd gone off the end.
>
> Running ./app
> core.exception.AssertError at ../../../../.dub/packages/dxml-0.4.1/dxml/sourc
> e/dxml/parser.d(1457): text cannot be called with elementEnd
> ----------------
> ??:? _d_assert_msg [0x104b3981a]
> ../../JMiscLib/source/jmisc/base.d:161 pure @property @safe
> immutable(char)[] dxml.parser.EntityRange!(dxml.parser.Config(1,
> 1, 1, 1), immutable(char)[]).EntityRange.Entity.text()
> [0x104b2297b]
> source/app.d:26 _Dmain [0x104aeb46e]
> Program exited with code 1
>
> ```
> <?xml version="1.0"?>
>
> <resultset statement="SELECT * FROM bible.t_asv
> " xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
>    <row>
>   <field name="id">01001001</field>
>   <field name="b">1</field>
>   <field name="c">1</field>
>   <field name="v">1</field>
>   <field name="t">In the beginning God created the heavens and the
> earth.</field>
>    </row>
>
>    <row>
>   <field name="id">01001002</field>
>   <field name="b">1</field>
>   <field name="c">1</field>
>   <field name="v">2</field>
>   <field name="t">And the earth was waste and void; and darkness
> was upon the face of the deep: and the Spirit of God moved upon
> the face of the waters.</field>
>    </row>
>
> ```
>
> ```d
> void main() {
>      import std.stdio;
>      import std.file : readText;
>      import dxml.parser;
>      import std.conv : to;
>
>      struct Verse {
>          string id;
>          int b, c, v;
>          string t;
>      }
>
>      auto range = parseXML!simpleXML(readText("xmltest.xml"));
>
>      // simpleXML skips comments
>
>      void pops(int c) {
>          foreach(_; 0 .. c)
>              range.popFront();
>      }
>      pops(3);
>
>      Verse[] vers;
>      foreach(_; 0 .. 2) {
>          Verse ver;
>          ver.id = range.front.text;
>          pops(3);
>          ver.b = range.front.text.to!int;
>          pops(3);
>          ver.c = range.front.text.to!int;
>          pops(3);
>          ver.v = range.front.text.to!int;
>          pops(3);
>          ver.t = range.front.text;
>
>          with(ver)
>              vers ~= Verse(id,b,c,v,t);
>
>          pops(2);
>      }
>      foreach(verse; vers) with(verse)
>          writeln(id, " Book: ", b, " ", c, ":", v, " -> ", t);
> }
> ```

You need to be checking the type of the entity before you call either name
or text on it, because not all entities have a name, and not all entities
have text - e.g. <field name="id"> is an EntityType.elementStart, so it has
a name (which is "field"), but it doesn't have text, whereas the 01001001
between the <field name="id"> and </field> tags has no name but does have
text, because it's an EntityType.text. If you call name or text without
verifying the type first, then you're almost certainly going to get an
assertion failure at some point (assuming that you don't compile with
-release anyway), since you're bound to end up with an entity that you don't
expect at some point (either because you were wrong about where you were in
the document, or because the document didn't match the layout that was
expected).

Per the assertion's message, you managed to call text on an
EntityType.elementEnd, and per the stack trace, text was called on this line

         ver.id = range.front.text;

If I add

         if(range.front.type == EntityType.elementEnd)
         {
             writeln(range.front.name);
             writeln(range.front.pos);
         }

right above that, I get

row
TextPos(11, 4)

indicating that the end tag was </row> and that it was on line 11, 4 code
units in (and since this is ASCII, that would be 4 characters). So, you
managed to parse all of the <field>***</field> lines but didn't correctly
deal with the end of that section.

If I add

    writeln(range.front);

right before

    pops(2);

then I get:

Entity(text, TextPos(10, 25), , Text!(ByCodeUnitImpl)(In the beginning God
created the heavens and the earth., TextPos(10, 25)))

So, prior to popping twice, it's on the text between <field name="t"> and
</field>, which looks like it's what you intended. If you look at the XML
after that, it should be clear why you're in the wrong place afterwards.

Since at that point, range.front is on the EntityType.text between
<field name="t"> and </field>, popping once makes it so that range.front is
</field>. And popping a second time makes range.front </row>, which is where
the range is when it the tries to call text at the top of the loop.
Presumably, you want it to be on the EntityType.text in

        <field name="id">01001002</field>

To get there from </row>, you'd have to pop once to get to <row>, a second
time to get to <field>, and a third time to get to 01001002. So, if you had

        pops(5);

instead of

        pops(2);

the range would be at the correct place at the top of the loop - though it
would then be the wrong number of times to pop the second time around. With
the text as provided, it would throw an XMLParsingException when it reached
the end of the loop the second time, because the XML document doesn't have
the matching </resultset> tag, and with that fixed, you end up with an
assertion failure, because popFront was called on an empty range (since
there aren't 7 elements left in the range at that point):

core.exception.AssertError at ../../.dub/packages/dxml-0.4.0/dxml/source/dxml
/parser.d(1746): It's illegal to call popFront() on an empty EntityRange.

So, you'd need to adjust the end of the loop so that it only pops what it
needs to pop on the second loop. If you don't care about any data after that
point, you could just make it not pop on the last iteration, or what would
probably be better would be to write the loop so that it expects to start on
<row>, and it will exit the loop if it's instead on an end tag (since that
would indicate the end of that section, and in this case, it would mean that
it was no the last entity in the document).

Regardless, if you're actually looking to parse a document like this in
production code instead of in something that's just thrown together to get
something done, you'd actually need to be checking the EntityType of each
element to make sure that it was what was expected so that you can provide
an error to the user when the document is malformed. dxml expects that you
will only ever call a property of an EntityRange.Entity which is valid for
that EntityType, and it asserts that it's not called on the wrong type. So,
if you don't check the EntityType, unless you can guarantee that the XML
document is as expected, you're going to get assertion failures when not
compiling with -release, and you'll get weird results when the assertions
are complied out with -release.

On an unrelated note, std.range.primitives.popFrontN (or
std.range.popFrontN, since std.range publicly imports std.range.primitives)
does what your pops function does - and it does it more efficiently for
ranges which have slicing (which dxml's EntityRange doesn't, but either way,
you can just use the function from Phobos instead of writing your own).

- Jonathan M Davis