Decoding HTML escape sequences

Hugo Florentino via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon May 12 15:09:02 PDT 2014


Hi, I have some documents where some strings appears in HTML escape 
sequences in one of these forms:

\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e

%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e

And I would like to recode them to readable form:

<SCRIPT LANGUAGE="Javascript">

I tried something like this, using regular expressions and the uri 
module:


import std.stdio, std.file, std.encoding, std.string, std.regex, 
std.uri;

static auto re = regex(`(%[a-fA-F0-9]{2})`);

int main(in string[] args)
{
   if (args.length < 2)
   {
     writeln("Usage: unescape file1.htm > file2.htm");
     return -1;
   }
   auto input = cast(Latin1String) read(args[1]);
   string buffer;
   transcode(input, buffer);

   string output;
   foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);

   writeln(output);

   return 0;
}


Unfortunately it doesn't seem to work 100%.

I would appreciate any suggestion.

Regards, Hugo


More information about the Digitalmars-d-learn mailing list