For those ready to take the challenge
Adam D. Ruppe via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Sat Jan 10 09:39:16 PST 2015
On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad
wrote:
> For the challenge to make sense it would entail parsing all
> legal HTML5 documents, extracting all resource links,
> converting them into absolute form and printing them one per
> line. With no hickups.
Though, that's still a library thing rather than a language thing.
dom.d and the Url struct in cgi.d should be able to do all that,
in just a few lines even, but that's just because I've done a
*lot* of web scraping with the libs before so I made them work
for that.
In fact.... let me to do it. I'll use my http2.d instead of
cgi.d, actually, it has a similar Url struct just more focused on
client requests.
import arsd.dom;
import arsd.http2;
import std.stdio;
void main() {
auto base = Uri("http://www.stroustrup.com/C++.html");
// http2 is a newish module of mine that aims to imitate
// a browser in some ways (without depending on curl btw)
auto client = new HttpClient();
auto request = client.navigateTo(base);
auto document = new Document();
// and http2 provides an asynchonous api but you can
// pretend it is sync by just calling waitForCompletion
auto response = request.waitForCompletion();
// parseGarbage uses a few tricks to fixup invalid/broken
HTML
// tag soup and auto-detect character encodings,
including when
// it lies about being UTF-8 but is actually Windows-1252
document.parseGarbage(response.contentText);
// Uri.basedOn returns a new absolute URI based on
something else
foreach(a; document.querySelectorAll("a[href]"))
writeln(Uri(a.href).basedOn(base));
}
Snippet of the printouts:
[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]
The latter are relative links that it based on and the first few
are absolute. Seems to have worked.
There's other kinds of links than just a[href], but fetching them
is as simple as adding them to the selector or looping for them
too separately:
foreach(a; document.querySelectorAll("script[src]"))
writeln(Uri(a.src).basedOn(base));
none on that page, no <link>s either, but it is easy enough to do
with the lib.
Looking at the source of that page, I find some invalid HTML and
lies about the character set. How did Document.parseGarbage do?
Pretty well, outputting the parsed DOM tree shows it
auto-corrected the problems I see by eye.
More information about the Digitalmars-d-learn
mailing list