dmd 1.060 and 2.045 release
Steven Schveighoffer
schveiguy at yahoo.com
Mon May 10 04:50:03 PDT 2010
On Sun, 09 May 2010 02:11:21 -0400, Lionello Lunesu
<lio at lunesu.remove.com> wrote:
> I'm in the middle of moving from one city to another so don't wait for
> me. I have attached the D version of the code in the wikipedia article
> (including the patch for transpositions.)
>
> It's not straightforward to drop this in (apart from it being in D),
> since speller.c creates all variations on a string (=inefficient) and
> uses a callback function to check if a variation is a valid symbol.
>
> I'll have more time to look at this next week.
Several others have privately brought up this problem to Walter. He does
not want to change how the symbol lookup tables work, and there is no way
to iterate them. Therefore, without a way to iterate current symbols,
this is the only way the algo can be written.
However, according to reports on the latest beta, he's sped up the lookup
times for symbols significantly enough to perceptually reduce the problem.
Because of the nature of the algorithm, the longer the invalid symbol, the
slower the algorithm. It would be interesting to see a comparison between
the current beta code and code that does a full iteration with very long
symbols. I don't know if anyone wants to look at modifying the symbol
lookup data structures to allow iteration, it may be perceptually
insignificant, and unimportant for most developers.
A quick test on a long symbol name:
void
s023456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567891()
{}
void main()
{
s123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890();
}
And on the various compilers:
2.043: 0.052s, does not suggest
2.045: 5.4s, suggests the alternative
beta: 0.8s, does not suggest
2.043 only does spell checking with a Levenshtein distance of 1, 2.045
does 2, but is extremely slow. The beta does a distance of 2, but only if
the errors are close together (Walter added this as an optimization to
remove one factor from the runtime complexity).
So clearly, there is still room for improvement, I think with a proper
symbol iteration, we could get the time down to be as quick as 2.043 or
faster *and* provide the ability to check for a complete LD of 2 where the
errors are not close together. We might be able to even push the LD to 3
or 4.
I've thought about the optimization of errors close together being
checked, and I think the counter case is the case which takes the longest
-- a long symbol. Typically people use camelCase to denote symbols, what
if two of the words in the symbol were misspelled by one character (or a
capitalization was forgotten)?
It may not be an issue, the spell checker is simply a nice hint, but isn't
essential to determine errors.
-Steve
More information about the Digitalmars-d-announce
mailing list