Learning to XML with D

Derix via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Feb 9 03:54:43 PST 2015


> my dom.d works in a familiar way
OK, will check it


> useful for scraping html sites.
Not exactly what I'm doing, but close. I'm in the midst of a 
self-training spree, and what I use as test-tubes fodder is the 
following : a collection of 300+ html files constituting an 
electronic version of a technical book. My intent is to generate 
a clickable table of contents, by parsing the files for css 
styles specific to section headers. The first leg of the journey 
was to normalize styles accross the bunch. That is done, more or 
less. I already have a proto-toc, but not entirely satisfying : 
lacks handles for propper styling, and the way I arrived there is 
kinda brutish.

One hurdle I haven't overcame yet is that the text content, and 
the section headers themsleves, contain some html tags (well, the 
book /is/ about html, among other things). For example, some 
section headers are rendered as two bold lines, with a fat <br/> 
in the middle, and <b></b> around. So when I parse the payload of 
the <p> element, I end up with some <br/> in the middle of 
a sentence. Survivable, but unclean.

So yeah, I'll give it another try with your dom.d


More information about the Digitalmars-d-learn mailing list