html fetcher/parser

Adam D. Ruppe via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Sun Aug 13 10:51:16 PDT 2017


On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
> Just curious, but is there a spec of sorts which defines which 
> errors should be fixed and such?

The HTML5 spec describes how you are supposed to parse various 
things, including the recovery paths for broken markup.

My module, however, isn't so formal. I just used it for a web 
scraping thing at work that hit a few hundred sites and fixed 
bugs as they came up to give good enough results for me.... (one 
thing I found is a lot of sites claiming to be UTF-8 are actually 
latin-1, so it validates and falls back to handle that. My http 
thing, while buggier, is similar - I hit a server once that 
ignored the accept gzip header and always sent it anyway, so I 
had to handle that... and I noticed curl actually didn't!)

So on the one hand, there's surely still bugs and weird cases, 
but on the other hand, it did get a fair chunk of real-world use 
so I am fairly confident it will be ok for most things.



More information about the Digitalmars-d-learn mailing list