htmlget.d example and unicode parsing
Tyro[a.c.edwards]
nospam at home.com
Sat Apr 30 21:25:36 PDT 2011
Hello all,
I am trying to learn how to parse, modify, and redisplay a Japanese
webpage passed to me in a form and am wondering if anyone has an example
of how to do this.
I looked at htmlget and found that it has a couple problems: namely, it
is not conform to current D2 practices. I am not sure that my hack can
be considered a fix but have attached it nonetheless. It now works
correctly on ascii based urls but not utf-8.
My lack of knowledge on how to properly parsing unicode documents has
left me stumped. I am therefore requesting some assistance in updating
the code such that it works with any url. I have taken a look at std.utf
and there are a few things there that could possibly assist me however
without examples I'm somewhat at a loss.
I'm assuming that the problem exists here:
for (iw = 0; iw != line.length; iw++)
{
if (!icmp("</html>", line[iw .. line.length]))
break print_lines;
}
From what I understanding, one cannot index a utf sequence the same as
you index ASCII. What is the proper what to rewrite this such that it
parses the utf characters correctly? And example would do wonders.
Thanks
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: htmlget.d
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20110501/ad43df88/attachment.ksh>
More information about the Digitalmars-d
mailing list