htmlget.d example and unicode parsing

Nick Sabalausky a at a.a
Sun May 1 11:54:47 PDT 2011


"Tyro[a.c.edwards]" <nospam at home.com> wrote in message 
news:ipinj3$1c77$1 at digitalmars.com...
> Hello all,
>
> I am trying to learn how to parse, modify, and redisplay a Japanese
> webpage passed to me in a form and am wondering if anyone has an example
> of how to do this.
>
> I looked at htmlget and found that it has a couple problems: namely, it
> is not conform to current D2 practices. I am not sure that my hack can
> be considered a fix but have attached it nonetheless. It now works
> correctly on ascii based urls but not utf-8.
>
> My lack of knowledge on how to properly parsing unicode documents has
> left me stumped. I am therefore requesting some assistance in updating
> the code such that it works with any url. I have taken a look at std.utf
> and there are a few things there that could possibly assist me however
> without examples I'm somewhat at a loss.
>
> I'm assuming that the problem exists here:
>
> for (iw = 0; iw != line.length; iw++)
>         {
>             if (!icmp("</html>", line[iw .. line.length]))
>                 break print_lines;
>         }
>
> From what I understanding, one cannot index a utf sequence the same as
> you index ASCII.

Depends on what exactly you're doing. There are many cases where indexing 
utf like ASCII works fine, and your code above looks like one of the cases 
where it should work (Unless icmp throws or asserts on invalid code-unit 
sequences. Anyone know offhand if it does?).

But you do have a non-utf-related bug in that loop. If there's anything in 
'line' after the "</html>" tag, then it won't detect the tag because you're 
slicing with the length of 'line' instead of the length of "</html>".

So it should be:

for (iw = 0; iw != line.length; iw++)
{
    immutable endTag = "</html>";
    if (line.length >= endTag.length && !icmp(endTag, line[iw .. 
endTag.length]))
        break print_lines;
}

On the topic of unicode, this is a really good introduction to the details 
of it:
http://www.joelonsoftware.com/articles/Unicode.html

But once you read that, keep in mind there's a few important details he 
failed to mention: A code-point is made up of code-units, yes, but a single 
code-point is *not* always an entire character (aka "grapheme"). Because of 
combining codes, a character could be made up of multiple code points (just 
like how a code point can be made up of multiple code units). Also, there 
are certain characters that can be represented with more than one specific 
sequence of code points (and that gets into unicode normalization).




More information about the Digitalmars-d mailing list