[RFC] I/O and Buffer Range

Brad Anderson eco at gnuk.net
Mon Dec 30 17:51:21 PST 2013


On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky 
wrote:
> Proposal

Having never written any parser I'm not really qualified to 
seriously give comments or review it but it all looks very nice 
to me.

Speaking as just an end user of these things whenever I use 
ranges over files or from, say, std.net.curl the byLine/byChunk 
interface always feels terribly awkward to use which often leads 
to me just giving up and loading the entire file/resource into an 
array. It's the boundaries that I stumble over. byLine never fits 
when I want to extract something multiline but byChunk doesn't 
fit because I often if what I'm searching for lands on the 
boundary I'll miss it.

Being able to just do a matchAll() on a file, std.net.curl, etc. 
without sacrificing performance and memory would be such a 
massive gain for usability.

Just a simple example of where I couldn't figure out how to 
utilize either byLine or byChunk without adding some clunky 
homegrown buffering solution. This is code that scrapes website 
titles from the pages of URLs in IRC messages.

---
auto scrapeTitles(M)(in M message)
{
     static url_re = ctRegex!(r"(https?|ftp)://[^\s/$.?#].[^\s]*", 
"i");
     static title_re = ctRegex!(r"<title.*?>(.*?)<", "si");
     static ws_re = ctRegex!(r"(\s{2,}|\n|\t)");

     auto utf8 = new EncodingSchemeUtf8;
     auto titles =
          matchAll(message, url_re)
         .map!(match => match.captures[0])
         .map!((url) => get(url).ifThrown([]))
         .map!(bytes => cast(string)
                        
utf8.sanitize(cast(immutable(ubyte)[])bytes))
         .map!(content => matchFirst(content, title_re))
         .filter!(captures => !captures.empty)
         .map!(capture => capture[1].idup) // dup so original is 
GCed
         .map!(title => title.entitiesToUni.replace(ws_re, " "))
         .array;

     return titles;
}
---

I really, really didn't want to use that std.net.curl.get().  It 
causes all sorts of problems if someone links to a huge resource. 
I just could not figure out how to utilize byLine (the title 
regex capture can be multiline) or byChunk cleanly. Code elegance 
(a lot of it due to Jakob Ovrum's help in IRC) was really a goal 
here as this is just a toy so I went with get() for the time 
being but it's always sad to sacrifice elegance for performance.  
I certainly didn't want to add some elaborate evergrowing buffer 
in the middle of this otherwise clean UFCS chain (and I'm not 
even sure how to incrementally regex search the growing buffer or 
if that's even possible).

If I'm understanding your proposal correctly that get(url) could 
be replaced with a hypothetical std.net.curl "buffer range" and 
that could be passed directly to matchFirst. It would only take 
up, at most, the size of the buffer in memory (which could grow 
if the capture grows to be larger than the buffer) and wouldn't 
read the unneeded portion of the resource at all. That would be 
such a huge win for everyone so I'm very excited about this 
proposal. It addresses all of my current problems.



P.S. I love std.regex more and more every day. It made that 
entitiesToUni function so easy to implement: 
http://dpaste.dzfl.pl/688f2e7d


More information about the Digitalmars-d mailing list