Extracting Structure from HTML using Adam's dom.d

via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Thu Jan 22 01:27:16 PST 2015


On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
> On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
>> This means that I need some kind of interface to extract all 
>> the contents of each <p> paragraph that is preceeded by a <h2> 
>> heading with a specific id (say "H2_A") or content (say "More 
>> important"). How do I accomplish that?
>
> You can do that with a CSS selector like:
>
> document.querySelector("#H2_A + p");
>
> or even document.querySelectorAll("h2 + p") to get every P 
> immediately following a h2.
>
>
> My implementation works mostly the same as in javascript so you 
> can read more about css selectors anywhere on the net like 
> https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector
>
>> Further, is there a way to extract the "contents" only of an 
>> Element instance, that is  "Stuff" from "<p>Stuff</p>" for 
>> each Element in the return of for example 
>> getElementsByTagName(`p`)?
>
> Element.innerText returns all the plain text inside with all 
> tags stripped out (same as the function in IE)
>
> Element.innerHTML returns all the content inside, including 
> tags (same as the function in all browsers)
>
> Element.firstInnerText returns all the text up to the first 
> tag, but then stops there. (this is a custom extension)
>
>
> You can call those in a regular foreach loop or with something 
> like std.algorithm.map to get the info from an array of 
> elements.

Brilliant! Thanks!

BTW: Would you be interested in receiving a PR for dom.d where I 
replace array allocations with calls to lazy ranges?


More information about the Digitalmars-d-learn mailing list