[RFC] I/O and Buffer Range

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Dec 31 01:23:49 PST 2013


31-Dec-2013 05:51, Brad Anderson пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> Proposal
>
> Having never written any parser I'm not really qualified to seriously
> give comments or review it but it all looks very nice to me.
>
> Speaking as just an end user of these things whenever I use ranges over
> files or from, say, std.net.curl the byLine/byChunk interface always
> feels terribly awkward to use which often leads to me just giving up and
> loading the entire file/resource into an array. It's the boundaries that
> I stumble over. byLine never fits when I want to extract something
> multiline but byChunk doesn't fit because I often if what I'm searching
> for lands on the boundary I'll miss it.

Exactly, the situation is simply not good enough. I can assure you that 
on the side of parser writers it's even less appealing.

>
> Being able to just do a matchAll() on a file, std.net.curl, etc. without
> sacrificing performance and memory would be such a massive gain for
> usability.

.. and performance ;)

>
> Just a simple example of where I couldn't figure out how to utilize
> either byLine or byChunk without adding some clunky homegrown buffering
> solution. This is code that scrapes website titles from the pages of
> URLs in IRC messages.
[snip]
>
> I really, really didn't want to use that std.net.curl.get().  It causes
> all sorts of problems if someone links to a huge resource.

*Nods*

> I just could
> not figure out how to utilize byLine (the title regex capture can be
> multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob
> Ovrum's help in IRC) was really a goal here as this is just a toy so I
> went with get() for the time being but it's always sad to sacrifice
> elegance for performance. I certainly didn't want to add some elaborate
> evergrowing buffer in the middle of this otherwise clean UFCS chain (and
> I'm not even sure how to incrementally regex search the growing buffer
> or if that's even possible).

I thought to provide something like that, incremental match that takes 
pieces of data slice by slice, having to mess with the not-yet-matched 
kind of object. But it was solving the wrong problem. And it shows that 
backtracking engines simply can't work like that, they would want to go 
back to the prior pieces.

>
> If I'm understanding your proposal correctly that get(url) could be
> replaced with a hypothetical std.net.curl "buffer range" and that could
> be passed directly to matchFirst. It would only take up, at most, the
> size of the buffer in memory (which could grow if the capture grows to
> be larger than the buffer) and wouldn't read the unneeded portion of the
> resource at all. That would be such a huge win for everyone so I'm very
> excited about this proposal. It addresses all of my current problems.

That's indeed what the proposal is all about. Glad it makes sense :)

>
>
> P.S. I love std.regex more and more every day. It made that
> entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d

Aye, replace with functor rox!

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list