html fetcher/parser

Adam D. Ruppe via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Sat Aug 12 13:22:44 PDT 2017


On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
> I would like to get into D again by making a small program 
> which fetches a website every X-time and keeps track of all 
> changes within specified dom elements.

My dom.d and http2.d combine to make this easy:

https://github.com/adamdruppe/arsd/blob/master/dom.d
https://github.com/adamdruppe/arsd/blob/master/http2.d

and support file for random encodings:

https://github.com/adamdruppe/arsd/blob/master/characterencodings.d


Or via dub:

http://code.dlang.org/packages/arsd-official

the dom and http subpackages are the ones you want.


Docs: http://dpldocs.info/arsd.dom


Sample program:

---
// compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}

import std.stdio;
import arsd.dom;

void main() {
         auto document = Document.fromUrl("https://dlang.org/");
         writeln(document.optionSelector("p").innerText);
}
---

Output:

D is a general-purpose programming language with
         static typing, systems-level access, and C-like syntax.
         It combines efficiency, control and modeling power with 
safety
         and programmer productivity.




Note that the https support requires OpenSSL available on your 
system. Works best on Linux with it installed as a devel lib (so 
like openssl-devel or whatever, just like you would if using it 
from C).



How it works:


Document.fromUrl uses the http lib to fetch it, then 
automatically parse the contents as a dom document. It will 
correct for common errors in webpage markup, character sets, etc.

Document and Element both have various methods for navigating, 
modifying, and accessing the DOM tree. Here, I used 
`optionSelector`, which works like `querySelector` in Javascript 
(and the same syntax is used for CSS), returning the first 
matching element.

querySelector, however, returns null if there is nothing found. 
optionSelector returns a dummy object instead, so you don't have 
to explicitly test it for null and instead just access its 
methods.

`innerText` returns the text inside, stripped of markup. You 
might also want `innerHTML`, or `toString` to get the whole 
thing, markup and all.



there's a lot more you can do too but just these few functions I 
think will be enough for your task.


Bonus fact: 
http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....


More information about the Digitalmars-d-learn mailing list