std.xml and Adam D Ruppe's dom module

Thu Feb 9 03:07:32 PST 2012

Am Wed, 08 Feb 2012 20:49:48 -0600
schrieb "Robert Jacques" <sandford at jhu.edu>:

> On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau
> <nospam at example.com> wrote:
> > Am Tue, 07 Feb 2012 20:44:08 -0500
> > schrieb "Jonathan M Davis" <jmdavisProg at gmx.com>:
> >> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
> >> > On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
> [snip]
> >
> > Using ranges of dchar directly can be horribly inefficient in some
> > cases, you'll need at least some kind off buffered dchar range. Some
> > std.json replacement code tried to use only dchar ranges and had to
> > reassemble strings character by character using Appender. That sucks
> > especially if you're only interested in a small part of the data and
> > don't care about the rest.
> > So for pull/sax parsers: Use buffering, return strings(better:
> > w/d/char[]) as slices to that buffer. If the user needs to keep a
> > string, he can still copy it. (String decoding should also be done
> > on-demand only).
> 
> Speaking as the one proposing said Json replacement, I'd like to
> point out that JSON strings != UTF strings: manual conversion is
> required some of the time. And I use appender as a dynamic buffer in
> exactly the manner you suggest. There's even an option to use a
> string cache to minimize total memory usage. (Hmm... that
> functionality should probably be re-factored out and made into its
> own utility) That said, I do end up doing a bunch of useless encodes
> and decodes, so I'm going to special case those away and add slicing
> support for strings. wstrings and dstring will still need to be
> converted as currently Json values only accept strings and therefore
> also Json tokens only support strings. As a potential user of the
> sax/pull interface would you prefer the extra clutter of special side
> channels for zero-copy wstrings and dstrings?

Regarding wstrings and dstrings: We'll JSON seems to be UTF8 in almost
all cases, so it's not that important. But i think it should be
possible to use templates to implement identical parsers for d/w/strings

Regarding the use of Appender: Long text ahead ;-)

I think pull parsers should really be as fast a possible and low-level.
For easy to use highlevel stuff there's always DOM and a safe,
high-level serialization API should be implemented based on the
PullParser as well. The serialization API would read only the requested
data, skipping the rest:
----------------
struct Data
{
    string link;
}
auto Data = unserialize!Data(json);
----------------

So in the PullParser we should
avoid memory allocation whenever possible, I think we can even avoid it
completely:

I think dchar ranges are just the wrong input type for parsers, parsers
should use buffered ranges or streams (which would be basically the
same). We could use a generic BufferedRange with real
dchar-ranges then. This BufferedRange could use a static buffer, so
there's no need to allocate anything.

The pull parser should return slices to the original string (if the
input is a string) or slices to the Range/Stream's buffer.
Of course, such a slice is only valid till the pull parser is called
again. The slice also wouldn't be decoded yet. And a slice string could
only be as long as the buffer, but I don't think this is an issue, a
512KB buffer can already store 524288 characters.

If the user wants to keep a string, he should really do
decodeJSONString(data).idup. There's a little more opportunity for
optimization: As long as a decoded json string is always smaller than
the encoded one(I don't know if it is), we could have a decodeJSONString
function which overwrites the original buffer --> no memory allocation.

If that's not the case, decodeJSONString has to allocate iff the
decoded string is different. So we need a function which always returns
the decoded string as a safe too keep copy and a function which returns
the decoded string as a slice if the decoded string is
the same as the original.

An example: string json = 
{
   "link":"http://www.google.com",
   "useless_data":"lorem ipsum",
   "more":{
      "not interested":"yes"
   }
}

now I'm only interested in the link. I should be possible to parse that
with zero memory allocations:

auto parser = Parser(json);
parser.popFront();
while(!parser.empty)
{
    if(parser.front.type == KEY
       && tempDecodeJSON(parser.front.value) == "link")
    {
        parser.popFront();
        assert(!parser.empty && parser.front.type == VALUE);
        return decodeJSON(parser.front.value); //Should return a slice
    }
    //Skip everything else;
    parser.popFront();
}

tempDecodeJSON returns a decoded string, which (usually) isn't safe to
store(it can/should be a slice to the internal buffer, here it's a
slice to the original string, so it could be stored, but there's no
guarantee). In this case, the call to tempDecodeJSON could even be left
out, as we only search for "link" wich doesn't need encoding.