html fetcher/parser

Faux Amis via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Aug 14 16:15:13 PDT 2017


On 2017-08-13 19:51, Adam D. Ruppe wrote:
> On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
>> Just curious, but is there a spec of sorts which defines which errors 
>> should be fixed and such?
> 
> The HTML5 spec describes how you are supposed to parse various things, 
> including the recovery paths for broken markup.
> 
> My module, however, isn't so formal. I just used it for a web scraping 
> thing at work that hit a few hundred sites and fixed bugs as they came 
> up to give good enough results for me.... (one thing I found is a lot of 
> sites claiming to be UTF-8 are actually latin-1, so it validates and 
> falls back to handle that. My http thing, while buggier, is similar - I 
> hit a server once that ignored the accept gzip header and always sent it 
> anyway, so I had to handle that... and I noticed curl actually didn't!)
> 
> So on the one hand, there's surely still bugs and weird cases, but on 
> the other hand, it did get a fair chunk of real-world use so I am fairly 
> confident it will be ok for most things.
> 

Sounds good!
(Althought following the spec would be the first step to a D html layout 
engine :D )


More information about the Digitalmars-d-learn mailing list