How many HOFs in Phobos?

Mon Jan 31 15:08:42 PST 2011

On 1/31/11 3:37 PM, bearophile wrote:
> This is a high-level implementation of Levenshtein distance in Haskell:
>
>
> levenshtein s1 s2 = last $ foldl transform [0 .. length s1] s2
>    where transform ns@(n:ns') c = scanl calc (n+1) $ zip3 s1 ns ns'
>            where calc z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)]
>
> main = print (levenshtein "kitten" "sitting")
>
>
> (I don't explain that Haskell code because below I give a kind of D translation that is probably enough to understand.)
> It prints 3, the code comes from the rosettacode.org site:
> http://rosettacode.org/wiki/Levenshtein_distance#Haskell
>
> Currently in std.algorithm there are higher order functions (HOFs) like "map", "filter", "reduce" etc, reduce is the right fold.
>
> That Haskell code uses a foldl (fold left), and scanl (like foldl, but keeps and returns all intermediate values too). This is a brutal translation to D2:
>
>
> import std.stdio, std.range, std.array, std.algorithm, std.typecons;
>
> /// foldl f z [x1, x2, ..., xn] == (...((z `f` x1) `f` x2) `f`...) `f` xn
> T1 foldl(F, T1, Range)(F f, T1 z, Range xs) {
>      foreach (x; xs)
>          z = f(z, x);
>      return z;
> }
>
> /// scanl f z [x1, x2, ...] == [z, z `f` x1, (z `f` x1) `f` x2, ...]
> T1[] scanl(F, T1, Range)(F f, T1 z, Range xs) {
>      auto result = [z];
>      foreach (x; xs) {
>          z = f(z, x);
>          result ~= z;
>      }
>      return result;
> }
>
> auto levenshtein(string s1, string s2) {
>      int[] transform(int[] ns, char c) {
>          auto n = ns[0];
>          auto ns2 = ns[1..$];
>          int compute(int z, Tuple!(dchar, int, int) t) {
>              return min(t[2]+1, z+1, t[1] + (t[0] != c));
>          }
>          return scanl(&compute, n+1, zip(s1, ns, ns2));
>      }
>      return foldl(&transform, array(iota(cast(int)s1.length+1)), s2)[$-1];
> }
>
> void main() {
>      string add(string x, string y) { return x ~ y; }
>      writeln(foldl(&add, "", ["ab", "cd", "ef"])); // "abcdef"
>      writeln(scanl(&add, "", ["ab", "cd", "ef"])); // ["", "ab", "abcd", "abcdef"]
>
>      writeln(levenshtein("kitten", "sitting")); // 3
> }
>
>
> Haskell has tons of built-in HOFs (monadic ones too, like foldM), but D is less functional than Haskell, and reading highly functional-style code is not easy for many programmers (and D syntax limits make such code noisy). What's the right number of HOFs to put in std.algorithm+std.range? Are (better implementations of) scanl and foldl fit for Phobos? (I am not suggesting to rename "reduce" to "foldr". You may add an "alias reduce foldr;" if you want, but that's all).
>
> Hours ago I have added this suggestion to bugzilla:
> http://d.puremagic.com/issues/show_bug.cgi?id=5510
>
> Bye,
> bearophile

Thanks for this work.

The Haskell implementation doesn't scale. My testbed:

levenshtein s1 s2 = last $ foldl transform [0 .. length s1] s2
   where transform ns@(n:ns') c = scanl calc (n+1) $ zip3 s1 ns ns'
           where calc z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' 
/= c)]

n = 4
s = concat $ replicate n "kitten"
t = concat $ replicate n "sitting"

main = print (levenshtein s t)

I assigned n doubling values: 4 8 16 32 64 ... and measured the running 
times gave me:

4 -> 3ms
8 -> 4ms
16 -> 8ms
32 -> 32ms
64 -> 137ms
128 -> 607ms
256 -> 2.6s
512 -> 15.8s
1024 -> 78s
2048 -> doesn't finish with -O, crashes without -O

I used ghc -O on a top-of-the-line server (8 cores, 64 GB RAM).

Then I ran your implementation on a MacBook Pro with dmd -O -release 
-inline:

4 -> 4ms
8 -> 4ms
16 -> 6ms
32 -> 11ms
64 -> 36ms
128 -> 108ms
256 -> 402ms
512 -> 1.59s
1024 -> 6.36s
2048 -> 25.43s
4096 -> 100.2s

Finally, on the same laptop and with the same options I ran the 
std-provided levenshteinDistance, obtaining:

4 -> 5ms
8 -> 5ms
16 -> 5ms
32 -> 6ms
64 -> 11ms
128 -> 33ms
256 -> 112ms
512 -> 431ms
1024 -> 1.8s
2048 -> 7.9s
4096 -> 37.9s

Note that levenshteinDistance is more general (works with any forward 
ranges, automatically performs correct UTF decoding, accepts custom 
comparison).

The trend of the two D implementations is quadratic: time required is 4x 
upon doubling the size of the input. The trend of the Haskell 
implementation is super-quadratic.

Note that reduce is akin to foldl, not foldr. To answer your question, 
adding functional primitives to Phobos is good but should be done in 
moderation; a staunch functional approach is not always the most 
recommendable.

Andrei