For those ready to take the challenge

Adam D. Ruppe via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Sat Jan 10 09:39:16 PST 2015


On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad 
wrote:
> For the challenge to make sense it would entail parsing all 
> legal HTML5 documents, extracting all resource links, 
> converting them into absolute form and printing them one per 
> line. With no hickups.

Though, that's still a library thing rather than a language thing.

dom.d and the Url struct in cgi.d should be able to do all that, 
in just a few lines even, but that's just because I've done a 
*lot* of web scraping with the libs before so I made them work 
for that.

In fact.... let me to do it. I'll use my http2.d instead of 
cgi.d, actually, it has a similar Url struct just more focused on 
client requests.


import arsd.dom;
import arsd.http2;
import std.stdio;

void main() {
	auto base = Uri("http://www.stroustrup.com/C++.html");
         // http2 is a newish module of mine that aims to imitate
         // a browser in some ways (without depending on curl btw)
	auto client = new HttpClient();
	auto request = client.navigateTo(base);
	auto document = new Document();

         // and http2 provides an asynchonous api but you can
         // pretend it is sync by just calling waitForCompletion
	auto response = request.waitForCompletion();

         // parseGarbage uses a few tricks to fixup invalid/broken 
HTML
         // tag soup and auto-detect character encodings, 
including when
         // it lies about being UTF-8 but is actually Windows-1252
	document.parseGarbage(response.contentText);

         // Uri.basedOn returns a new absolute URI based on 
something else
	foreach(a; document.querySelectorAll("a[href]"))
		writeln(Uri(a.href).basedOn(base));
}


Snippet of the printouts:

[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]

The latter are relative links that it based on and the first few 
are absolute. Seems to have worked.


There's other kinds of links than just a[href], but fetching them 
is as simple as adding them to the selector or looping for them 
too separately:

	foreach(a; document.querySelectorAll("script[src]"))
		writeln(Uri(a.src).basedOn(base));

none on that page, no <link>s either, but it is easy enough to do 
with the lib.



Looking at the source of that page, I find some invalid HTML and 
lies about the character set. How did Document.parseGarbage do? 
Pretty well, outputting the parsed DOM tree shows it 
auto-corrected the problems I see by eye.


More information about the Digitalmars-d-learn mailing list