html fetcher/parser
Adam D. Ruppe via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Sun Aug 13 10:51:16 PDT 2017
On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
> Just curious, but is there a spec of sorts which defines which
> errors should be fixed and such?
The HTML5 spec describes how you are supposed to parse various
things, including the recovery paths for broken markup.
My module, however, isn't so formal. I just used it for a web
scraping thing at work that hit a few hundred sites and fixed
bugs as they came up to give good enough results for me.... (one
thing I found is a lot of sites claiming to be UTF-8 are actually
latin-1, so it validates and falls back to handle that. My http
thing, while buggier, is similar - I hit a server once that
ignored the accept gzip header and always sent it anyway, so I
had to handle that... and I noticed curl actually didn't!)
So on the one hand, there's surely still bugs and weird cases,
but on the other hand, it did get a fair chunk of real-world use
so I am fairly confident it will be ok for most things.
More information about the Digitalmars-d-learn
mailing list