Range interface for std.serialization

Thu Aug 22 10:39:17 PDT 2013

Am Thu, 22 Aug 2013 17:49:04 +0200
schrieb "Dicebot" <public at dicebot.lv>:

> On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
> > The reason is simple: In serialization it is not common to 
> > post-process
> > the serialized data as far as I know. Usually it's either 
> > written to a
> > file or sent over network which are perfect examples of Streams 
> > (or
> > output ranges).
> 
> Hm but in this model it is file / socket which is an OutputRange, 
> isn't it? Serializer itself just provides yet another InputRange 
> which can be fed to target OutputRange. Am I getting this part 
> wrong?

Yes, but the important point is that Serializer is _not_ an InputRange
of serialized data. Instead it _uses_ a OutputRange / Stream
internally.

I'll show a very simplified example:
---------------------
struct Serializer(T) //if(isOutputRange!(T, ubyte[]))
{
    private T _output;
    this(T output)
    {
        _output = output;
    }

    void serialize(T)(T data)
    {
        _output.put((cast(ubyte*)&data)[0..T.sizeof]);
    }
}

void put(File f, ubyte[] data) //File is not an OutputRange...
{
	f.write(data);
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
}
---------------------

As you can see there are absolutely no memory allocations necessary. Of
course in reality you'll need a fixed buffer but there's no dynamic
allocation.

Now try to implement this in an efficient way as an InputRange. Here's
the skeleton:
---------------------
struct Serializer
{
    void serialize(T)(T data) {}
    bool empty() {}
    ubyte[] front;
    void popFront() {}
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
    foreach(ubyte[] data; serializer)
}
---------------------

How would you implement this? This can only work efficiently if
Serializer wraps its InputRange or if there's only one value to
serialize. But the serialize method as defined above cannot be
implemented efficiently with this approach.

Now I do confess that an InputRange filter is useful. But only for
specific use cases, the more common use case is directly outputting to
an OutputRange and this should be as efficient as possible. With a good
design it should be possible to support both cases efficiently with the
same "backends". But implementing a InputRange serializer filter will
still be much more difficult than the OutputRange case (the serializer
must be capable of resuming serialization at any point as your output
buffer might be full)

I'd like to make another comment about performance. I think there are
two possible usages / user groups of std.serialization.

1) The classical, heavyweight C#/Java style serialization which can
serialize complete Object Graphs, deals with inheritance and so on

2) The simple "Just write the JSON representation of this struct to
this file" kind of usage.

For usecase 2 it's important that there's as little overhead as
possible. Consider this struct:

struct Song
{
    string artist;
    string title;
}

If I'd write JSON serialization manually, it would look like this:
---------
auto a = Appender!string; //or any outputRange
Song s;
a.put("{\n");
a.put(`    "artist"="`);
a.put(song.artist);
a.put(`",\n`);
a.put(`    "title"="`);
a.put(song.title);
a.put(`"\n}\n`);
---------

As you can see this code does basically nothing: No allocation, no
string processing, it just copies data. But it's annoying to write
this boilerplate.
I'd expect a serialization lib to let me do this:
serialize!JSON(a, s);
And performance should be very close to the hand-written code written
above.