html fetcher/parser
Adam D. Ruppe via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Sat Aug 12 13:22:44 PDT 2017
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
> I would like to get into D again by making a small program
> which fetches a website every X-time and keeps track of all
> changes within specified dom elements.
My dom.d and http2.d combine to make this easy:
https://github.com/adamdruppe/arsd/blob/master/dom.d
https://github.com/adamdruppe/arsd/blob/master/http2.d
and support file for random encodings:
https://github.com/adamdruppe/arsd/blob/master/characterencodings.d
Or via dub:
http://code.dlang.org/packages/arsd-official
the dom and http subpackages are the ones you want.
Docs: http://dpldocs.info/arsd.dom
Sample program:
---
// compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}
import std.stdio;
import arsd.dom;
void main() {
auto document = Document.fromUrl("https://dlang.org/");
writeln(document.optionSelector("p").innerText);
}
---
Output:
D is a general-purpose programming language with
static typing, systems-level access, and C-like syntax.
It combines efficiency, control and modeling power with
safety
and programmer productivity.
Note that the https support requires OpenSSL available on your
system. Works best on Linux with it installed as a devel lib (so
like openssl-devel or whatever, just like you would if using it
from C).
How it works:
Document.fromUrl uses the http lib to fetch it, then
automatically parse the contents as a dom document. It will
correct for common errors in webpage markup, character sets, etc.
Document and Element both have various methods for navigating,
modifying, and accessing the DOM tree. Here, I used
`optionSelector`, which works like `querySelector` in Javascript
(and the same syntax is used for CSS), returning the first
matching element.
querySelector, however, returns null if there is nothing found.
optionSelector returns a dummy object instead, so you don't have
to explicitly test it for null and instead just access its
methods.
`innerText` returns the text inside, stripped of markup. You
might also want `innerHTML`, or `toString` to get the whole
thing, markup and all.
there's a lot more you can do too but just these few functions I
think will be enough for your task.
Bonus fact:
http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....
More information about the Digitalmars-d-learn
mailing list