parsing HTML for a web robot (crawler) like application
Adam D. Ruppe via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Mar 23 21:02:04 PDT 2016
On Wednesday, 23 March 2016 at 10:49:03 UTC, Nordlöw wrote:
> HTML-docs here:
>
> http://dpldocs.info/experimental-docs/arsd.dom.html
Indeed, though the docs are still a work in progress (the lib is
now about 6 years old, but until recently, ddoc blocked me from
using examples in the comments so I didn't bother. I've fixed
that now though, but haven't finished writing them all up).
Basic idea though for web scraping:
auto document = new Document();
document.parseGarbage(your_html_string);
// supports most the CSS syntax, and you might also know it from
jQuery
Element[] elements = document.querySelectorAll("css selector");
// or if you just want the first hit or null if none...
Element element = document.querySelector("css selector");
And once you have a reference:
element.innerText
element.innerHTML
to print its contents in some form.
You can do a lot more too (a LOT more), but just these functions
should get you started.
The parseGarbage function will also need you to compile in the
characterencodings.d file from my same github. It will handle
charset detection and translation as well as tag soup parsing. I
use it for a lot of web scraping myself.
More information about the Digitalmars-d-learn
mailing list