Need a Faster Compressor
Timon Gehr via Digitalmars-d
digitalmars-d at puremagic.com
Mon May 23 12:32:32 PDT 2016
On 22.05.2016 00:07, Walter Bright wrote:
> On 5/21/2016 2:37 PM, Timon Gehr wrote:
>> Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
>
> I don't understand the terms you use, but as to the "why" it is based on
> what I knew about LZ77 compression. I don't pretend to be an expert on
> compression, and this is what I churned out. As mentioned elsewhere, the
> C++ mangling scheme has a primitive and ineffective LZ77 scheme built
> in, so I wouldn't waste time on that.
>
> A quick google search on finding the longest match in a string did not
> turn up anything obvious.
> ...
E.g. the Knuth-Morris-Pratt (KMP) string search algorithm can be
modified to compute longest match instead of full match (as it just
efficiently builds and runs a finite state machine whose state is the
length of the longest match ending at the current position).
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Implementation:
auto longestMatch(string x,string y){
if(y.length>x.length) y=y[0..x.length];
auto f=new size_t[y.length+1];
for(size_t i=0,j=f[0]=-1;i<y.length;f[++i]=++j)
while(~j&&y[i]!=y[j]) j=f[j];
auto r=x[0..0];
for(size_t i=0,j=0;i<x.length&&j<y.length;++i,++j){
while(~j&&x[i]!=y[j]) j=f[j];
if(j+1>r.length) r=x[i-j..i+1];
}
return r;
}
This returns a slice of x representing the leftmost longest match with
y. Running time is O(x.length). (In practice, if this should be really
fast, the allocation for 'f' should probably be shared among multiple
calls.)
(This probably only improves running time in case there are sufficiently
many sufficiently long matches.)
But this just improves longest_match. It should be possible to bring
down the running time of the entire compression algorithm significantly
using a suffix tree (the resulting running time bound is linear in the
input size and independent of the window size).
> If you want to throw your hat (i.e. expertise) into the ring and post a
> faster compressor, please do so!
As far as I understand, the use case for this is compressing mangled
names? I don't think that generating huge mangled names explicitly just
to compress them afterwards is a viable strategy. This process throws
away a lot of structure information during mangling that could be useful
for fast and effective compression.
Instead, compression should be performed while generating the mangled
string. As far as I understand, mangled names only grow unmanageably
large because the same symbols are mangled into them multiple times?
Isn't it enough to create references to previously embedded mangled
symbols (i-th symbol already mangled) while generating the mangled string?
More information about the Digitalmars-d
mailing list