std.xml and Adam D Ruppe's dom module

Sean Kelly sean at invisibleduck.org
Thu Feb 9 08:01:48 PST 2012


This. And decoded JSON strings are always smaller than encoded strings--JSON uses escaping to encode non UTF-8 stuff, so in the case where someone sends a surrogate pair (legal in JSON) it's encoded as \u0000\u0000. In short, it's absolutely possible to create a pull parser that never allocates, even for decoding. As proof, I've done it before. :-p

On Feb 9, 2012, at 3:07 AM, Johannes Pfau <nospam at example.com> wrote:

> Am Wed, 08 Feb 2012 20:49:48 -0600
> schrieb "Robert Jacques" <sandford at jhu.edu>:
> 
>> On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau
>> <nospam at example.com> wrote:
>>> Am Tue, 07 Feb 2012 20:44:08 -0500
>>> schrieb "Jonathan M Davis" <jmdavisProg at gmx.com>:
>>>> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
>>>>> On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
>> [snip]
>>> 
>>> Using ranges of dchar directly can be horribly inefficient in some
>>> cases, you'll need at least some kind off buffered dchar range. Some
>>> std.json replacement code tried to use only dchar ranges and had to
>>> reassemble strings character by character using Appender. That sucks
>>> especially if you're only interested in a small part of the data and
>>> don't care about the rest.
>>> So for pull/sax parsers: Use buffering, return strings(better:
>>> w/d/char[]) as slices to that buffer. If the user needs to keep a
>>> string, he can still copy it. (String decoding should also be done
>>> on-demand only).
>> 
>> Speaking as the one proposing said Json replacement, I'd like to
>> point out that JSON strings != UTF strings: manual conversion is
>> required some of the time. And I use appender as a dynamic buffer in
>> exactly the manner you suggest. There's even an option to use a
>> string cache to minimize total memory usage. (Hmm... that
>> functionality should probably be re-factored out and made into its
>> own utility) That said, I do end up doing a bunch of useless encodes
>> and decodes, so I'm going to special case those away and add slicing
>> support for strings. wstrings and dstring will still need to be
>> converted as currently Json values only accept strings and therefore
>> also Json tokens only support strings. As a potential user of the
>> sax/pull interface would you prefer the extra clutter of special side
>> channels for zero-copy wstrings and dstrings?
> 
> Regarding wstrings and dstrings: We'll JSON seems to be UTF8 in almost
> all cases, so it's not that important. But i think it should be
> possible to use templates to implement identical parsers for d/w/strings
> 
> Regarding the use of Appender: Long text ahead ;-)
> 
> I think pull parsers should really be as fast a possible and low-level.
> For easy to use highlevel stuff there's always DOM and a safe,
> high-level serialization API should be implemented based on the
> PullParser as well. The serialization API would read only the requested
> data, skipping the rest:
> ----------------
> struct Data
> {
>    string link;
> }
> auto Data = unserialize!Data(json);
> ----------------
> 
> So in the PullParser we should
> avoid memory allocation whenever possible, I think we can even avoid it
> completely:
> 
> I think dchar ranges are just the wrong input type for parsers, parsers
> should use buffered ranges or streams (which would be basically the
> same). We could use a generic BufferedRange with real
> dchar-ranges then. This BufferedRange could use a static buffer, so
> there's no need to allocate anything.
> 
> The pull parser should return slices to the original string (if the
> input is a string) or slices to the Range/Stream's buffer.
> Of course, such a slice is only valid till the pull parser is called
> again. The slice also wouldn't be decoded yet. And a slice string could
> only be as long as the buffer, but I don't think this is an issue, a
> 512KB buffer can already store 524288 characters.
> 
> If the user wants to keep a string, he should really do
> decodeJSONString(data).idup. There's a little more opportunity for
> optimization: As long as a decoded json string is always smaller than
> the encoded one(I don't know if it is), we could have a decodeJSONString
> function which overwrites the original buffer --> no memory allocation.
> 
> If that's not the case, decodeJSONString has to allocate iff the
> decoded string is different. So we need a function which always returns
> the decoded string as a safe too keep copy and a function which returns
> the decoded string as a slice if the decoded string is
> the same as the original.
> 
> An example: string json = 
> {
>   "link":"http://www.google.com",
>   "useless_data":"lorem ipsum",
>   "more":{
>      "not interested":"yes"
>   }
> }
> 
> now I'm only interested in the link. I should be possible to parse that
> with zero memory allocations:
> 
> auto parser = Parser(json);
> parser.popFront();
> while(!parser.empty)
> {
>    if(parser.front.type == KEY
>       && tempDecodeJSON(parser.front.value) == "link")
>    {
>        parser.popFront();
>        assert(!parser.empty && parser.front.type == VALUE);
>        return decodeJSON(parser.front.value); //Should return a slice
>    }
>    //Skip everything else;
>    parser.popFront();
> }
> 
> tempDecodeJSON returns a decoded string, which (usually) isn't safe to
> store(it can/should be a slice to the internal buffer, here it's a
> slice to the original string, so it could be stored, but there's no
> guarantee). In this case, the call to tempDecodeJSON could even be left
> out, as we only search for "link" wich doesn't need encoding.


More information about the Digitalmars-d mailing list