html fetcher/parser
Faux Amis via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Aug 14 16:15:13 PDT 2017
On 2017-08-13 19:51, Adam D. Ruppe wrote:
> On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
>> Just curious, but is there a spec of sorts which defines which errors
>> should be fixed and such?
>
> The HTML5 spec describes how you are supposed to parse various things,
> including the recovery paths for broken markup.
>
> My module, however, isn't so formal. I just used it for a web scraping
> thing at work that hit a few hundred sites and fixed bugs as they came
> up to give good enough results for me.... (one thing I found is a lot of
> sites claiming to be UTF-8 are actually latin-1, so it validates and
> falls back to handle that. My http thing, while buggier, is similar - I
> hit a server once that ignored the accept gzip header and always sent it
> anyway, so I had to handle that... and I noticed curl actually didn't!)
>
> So on the one hand, there's surely still bugs and weird cases, but on
> the other hand, it did get a fair chunk of real-world use so I am fairly
> confident it will be ok for most things.
>
Sounds good!
(Althought following the spec would be the first step to a D html layout
engine :D )
More information about the Digitalmars-d-learn
mailing list