dmd 1.060 and 2.045 release

Mon May 10 04:50:03 PDT 2010

On Sun, 09 May 2010 02:11:21 -0400, Lionello Lunesu  
<lio at lunesu.remove.com> wrote:

> I'm in the middle of moving from one city to another so don't wait for
> me. I have attached the D version of the code in the wikipedia article
> (including the patch for transpositions.)
>
> It's not straightforward to drop this in (apart from it being in D),
> since speller.c creates all variations on a string (=inefficient) and
> uses a callback function to check if a variation is a valid symbol.
>
> I'll have more time to look at this next week.

Several others have privately brought up this problem to Walter.  He does  
not want to change how the symbol lookup tables work, and there is no way  
to iterate them.  Therefore, without a way to iterate current symbols,  
this is the only way the algo can be written.

However, according to reports on the latest beta, he's sped up the lookup  
times for symbols significantly enough to perceptually reduce the problem.

Because of the nature of the algorithm, the longer the invalid symbol, the  
slower the algorithm.  It would be interesting to see a comparison between  
the current beta code and code that does a full iteration with very long  
symbols.  I don't know if anyone wants to look at modifying the symbol  
lookup data structures to allow iteration, it may be perceptually  
insignificant, and unimportant for most developers.

A quick test on a long symbol name:

void  
s023456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567891()  
{}
void main()
{
      s123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890();
}

And on the various compilers:

2.043: 0.052s, does not suggest
2.045: 5.4s, suggests the alternative
beta: 0.8s, does not suggest

2.043 only does spell checking with a Levenshtein distance of 1, 2.045  
does 2, but is extremely slow.  The beta does a distance of 2, but only if  
the errors are close together (Walter added this as an optimization to  
remove one factor from the runtime complexity).

So clearly, there is still room for improvement, I think with a proper  
symbol iteration, we could get the time down to be as quick as 2.043 or  
faster *and* provide the ability to check for a complete LD of 2 where the  
errors are not close together.  We might be able to even push the LD to 3  
or 4.

I've thought about the optimization of errors close together being  
checked, and I think the counter case is the case which takes the longest  
-- a long symbol.  Typically people use camelCase to denote symbols, what  
if two of the words in the symbol were misspelled by one character (or a  
capitalization was forgotten)?

It may not be an issue, the spell checker is simply a nice hint, but isn't  
essential to determine errors.

-Steve