Decoding HTML escape sequences
Hugo Florentino via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon May 12 15:09:02 PDT 2014
Hi, I have some documents where some strings appears in HTML escape
sequences in one of these forms:
\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e
%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e
And I would like to recode them to readable form:
<SCRIPT LANGUAGE="Javascript">
I tried something like this, using regular expressions and the uri
module:
import std.stdio, std.file, std.encoding, std.string, std.regex,
std.uri;
static auto re = regex(`(%[a-fA-F0-9]{2})`);
int main(in string[] args)
{
if (args.length < 2)
{
writeln("Usage: unescape file1.htm > file2.htm");
return -1;
}
auto input = cast(Latin1String) read(args[1]);
string buffer;
transcode(input, buffer);
string output;
foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);
writeln(output);
return 0;
}
Unfortunately it doesn't seem to work 100%.
I would appreciate any suggestion.
Regards, Hugo
More information about the Digitalmars-d-learn
mailing list