[RFC] I/O and Buffer Range
Brad Anderson
eco at gnuk.net
Mon Dec 30 17:51:21 PST 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky
wrote:
> Proposal
Having never written any parser I'm not really qualified to
seriously give comments or review it but it all looks very nice
to me.
Speaking as just an end user of these things whenever I use
ranges over files or from, say, std.net.curl the byLine/byChunk
interface always feels terribly awkward to use which often leads
to me just giving up and loading the entire file/resource into an
array. It's the boundaries that I stumble over. byLine never fits
when I want to extract something multiline but byChunk doesn't
fit because I often if what I'm searching for lands on the
boundary I'll miss it.
Being able to just do a matchAll() on a file, std.net.curl, etc.
without sacrificing performance and memory would be such a
massive gain for usability.
Just a simple example of where I couldn't figure out how to
utilize either byLine or byChunk without adding some clunky
homegrown buffering solution. This is code that scrapes website
titles from the pages of URLs in IRC messages.
---
auto scrapeTitles(M)(in M message)
{
static url_re = ctRegex!(r"(https?|ftp)://[^\s/$.?#].[^\s]*",
"i");
static title_re = ctRegex!(r"<title.*?>(.*?)<", "si");
static ws_re = ctRegex!(r"(\s{2,}|\n|\t)");
auto utf8 = new EncodingSchemeUtf8;
auto titles =
matchAll(message, url_re)
.map!(match => match.captures[0])
.map!((url) => get(url).ifThrown([]))
.map!(bytes => cast(string)
utf8.sanitize(cast(immutable(ubyte)[])bytes))
.map!(content => matchFirst(content, title_re))
.filter!(captures => !captures.empty)
.map!(capture => capture[1].idup) // dup so original is
GCed
.map!(title => title.entitiesToUni.replace(ws_re, " "))
.array;
return titles;
}
---
I really, really didn't want to use that std.net.curl.get(). It
causes all sorts of problems if someone links to a huge resource.
I just could not figure out how to utilize byLine (the title
regex capture can be multiline) or byChunk cleanly. Code elegance
(a lot of it due to Jakob Ovrum's help in IRC) was really a goal
here as this is just a toy so I went with get() for the time
being but it's always sad to sacrifice elegance for performance.
I certainly didn't want to add some elaborate evergrowing buffer
in the middle of this otherwise clean UFCS chain (and I'm not
even sure how to incrementally regex search the growing buffer or
if that's even possible).
If I'm understanding your proposal correctly that get(url) could
be replaced with a hypothetical std.net.curl "buffer range" and
that could be passed directly to matchFirst. It would only take
up, at most, the size of the buffer in memory (which could grow
if the capture grows to be larger than the buffer) and wouldn't
read the unneeded portion of the resource at all. That would be
such a huge win for everyone so I'm very excited about this
proposal. It addresses all of my current problems.
P.S. I love std.regex more and more every day. It made that
entitiesToUni function so easy to implement:
http://dpaste.dzfl.pl/688f2e7d
More information about the Digitalmars-d
mailing list