Need a Faster Compressor

Mon May 23 12:32:32 PDT 2016

On 22.05.2016 00:07, Walter Bright wrote:
> On 5/21/2016 2:37 PM, Timon Gehr wrote:
>> Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?
>
> I don't understand the terms you use, but as to the "why" it is based on
> what I knew about LZ77 compression.  I don't pretend to be an expert on
> compression, and this is what I churned out. As mentioned elsewhere, the
> C++ mangling scheme has a primitive and ineffective LZ77 scheme built
> in, so I wouldn't waste time on that.
>
> A quick google search on finding the longest match in a string did not
> turn up anything obvious.
> ...

E.g. the Knuth-Morris-Pratt (KMP) string search algorithm can be 
modified to compute longest match instead of full match (as it just 
efficiently builds and runs a finite state machine whose state is the 
length of the longest match ending at the current position).

https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Implementation:

auto longestMatch(string x,string y){
     if(y.length>x.length) y=y[0..x.length];
     auto f=new size_t[y.length+1];
     for(size_t i=0,j=f[0]=-1;i<y.length;f[++i]=++j)
         while(~j&&y[i]!=y[j]) j=f[j];
     auto r=x[0..0];
     for(size_t i=0,j=0;i<x.length&&j<y.length;++i,++j){
         while(~j&&x[i]!=y[j]) j=f[j];
         if(j+1>r.length) r=x[i-j..i+1];
     }
     return r;
}

This returns a slice of x representing the leftmost longest match with 
y. Running time is O(x.length). (In practice, if this should be really 
fast, the allocation for 'f' should probably be shared among multiple 
calls.)

(This probably only improves running time in case there are sufficiently 
many sufficiently long matches.)

But this just improves longest_match. It should be possible to bring 
down the running time of the entire compression algorithm significantly 
using a suffix tree (the resulting running time bound is linear in the 
input size and independent of the window size).

> If you want to throw your hat (i.e. expertise) into the ring and post a
> faster compressor, please do so!

As far as I understand, the use case for this is compressing mangled 
names? I don't think that generating huge mangled names explicitly just 
to compress them afterwards is a viable strategy. This process throws 
away a lot of structure information during mangling that could be useful 
for fast and effective compression.

Instead, compression should be performed while generating the mangled 
string. As far as I understand, mangled names only grow unmanageably 
large because the same symbols are mangled into them multiple times? 
Isn't it enough to create references to previously embedded mangled 
symbols (i-th symbol already mangled) while generating the mangled string?