htmlget.d example and unicode parsing

Tyro[a.c.edwards] nospam at home.com
Sat Apr 30 21:25:36 PDT 2011


Hello all,

I am trying to learn how to parse, modify, and redisplay a Japanese 
webpage passed to me in a form and am wondering if anyone has an example 
of how to do this.

I looked at htmlget and found that it has a couple problems: namely, it 
is not conform to current D2 practices. I am not sure that my hack can 
be considered a fix but have attached it nonetheless. It now works 
correctly on ascii based urls but not utf-8.

My lack of knowledge on how to properly parsing unicode documents has 
left me stumped. I am therefore requesting some assistance in updating 
the code such that it works with any url. I have taken a look at std.utf 
and there are a few things there that could possibly assist me however 
without examples I'm somewhat at a loss.

I'm assuming that the problem exists here:

	for (iw = 0; iw != line.length; iw++)
         {
             if (!icmp("</html>", line[iw .. line.length]))
                 break print_lines;
         }

 From what I understanding, one cannot index a utf sequence the same as 
you index ASCII. What is the proper what to rewrite this such that it 
parses the utf characters correctly? And example would do wonders.

Thanks
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: htmlget.d
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20110501/ad43df88/attachment.ksh>


More information about the Digitalmars-d mailing list