Generalized Linear Models and Stochastic Gradient Descent in D

Sun Jun 11 03:21:03 PDT 2017

It is obvious that you took time and care to review the article. 
Thank you very much!

On Sunday, 11 June 2017 at 00:40:23 UTC, Nicholas Wilson wrote:
>
> Maybe its the default rendering but the open math font is hard 
> to read as the sub scripts get vertically compressed.
>
> My suggestions:
>
> Distinguish between the likelihood functions for gamma and 
> normal rather than calling them both L(x). Maybe L subscript 
> uppercase gamma and L subscript N?
>

Good idea!

> Links to wikipedia for the technical terms (e.g. dispersion, 
> chi squared, curvature), again the vertical compression of the 
> math font does not help here (subscripts of fractions) . It 
> will expand your audience if they don't get lost in the 
> introduction!
Yes I should definitely add clarification references. I should 
probably also add that curvature is also the Hessian, however I 
recently developed a dislike for calling mathematical constructs 
after people or odd names I was serious thinking about calling 
Newton-Raphson something else but that might be taking it too far.

I'll end up writing the final in html so I can add a decent html 
latex package to modify the size of the equations.

> Speaking of not losing your audience: give a link to the NRA 
> and/or a brief explanation of how it generalises to higher 
> dimensions (graph or animation for the 2D case would be good, 
> perhaps take something from wikipedia)

NRA? Don't understand that acronym with reference to the article.

I shall mention the generalisation of the equations over multiple 
observations.

I agree that there is a danger of loosing the audience and 
perhaps some graphics would be nice.

> I dont think it is necessary to show the signature of the BLAS 
> and Lapacke function, just a short description and link should 
> suffice. also any reason you don't use GLAS?

True

> I would just have gParamCalcs as its own function (unless you 
> are trying to show off that particular feature of D).

I think its easier to use mixin in since the same code is 
required since mu and k are used in gLogLik, gGradient, 
gCurvature and xB is also used in gLogLik. Showing off the use of 
mixins is also a plus.

> omit the parentheses of .array() and .reduce()

Yes

> You use .array a lot: how much of that is necessary? I dont 
> think it is in zip(k.repeat().take(n).array(), x, y, mu)

Yes, I should remove the points where .array() is not necessary

> `return(curv);` should be `return curve;`

Thanks that's my R bleeding into my D! So should be:

return curv;

> Any reason you don't square the tolerance rather than sqrt the 
> parsDiff?

The calculation is the L2 norm which ends in sqrt later used for 
the stopping criterion as in the equation

> for(int i = 0; i < nepochs; ++i) => foreach(i; iota(epochs))?

hmm potato

> zip(pars, x).map!(a => a[0]*a[1]).reduce!((a, b) => a + b); 
> =>dot(pars,x)?

Fair point. When I started writing the article I considered 
attempting to write the whole thing in D functional style - with 
no external libraries. In the end I didn't want to write a matrix 
inverse in functional style so I rolled it back somewhat and 
started adding C calls which is more sensible.

> Theres a lot of code and text, some images and graphs would be 
> nice, particularly in combination with a more real world 
> example use case.

I would agree that the article does need be less austere - 
however the article is about the GLM algorithm rather than its 
uses. I think the analyst should know whether they need a GLM or 
not - there are many sources that explain applications of GLM - I 
could perhaps reference some.

> Factor out code like a[2].repeat().take(a[1].length) to a 
> function, perhaps use some more BLAS routines for things like
>
> .map!( a =>
>                         zip(a[0].repeat().take(a[1].length),
>                             a[1],
>                             a[2].repeat().take(a[1].length),
>                             a[3].repeat().take(a[1].length))
>                         .map!(a => -a[2]*(a[0]/a[3])*a[1])
>                         .array())
>                     .array();
>
> to make it more obvious what the calculation is doing.

Yes

> It might not be the point of the article but it would be good 
> to show some performance figures, I'm sure optimisation tips 
> will be forthcoming.

Since I am using whatever cblas algorithm is installed I'm not 
sure that benchmarks would really mean much. Ilya raised a good 
point about the amount of copying I am doing - as I was writing 
it I thought so too. I address this below.

Thanks again for taking time to review the article!

My main take way from writing this article is that it would be 
quite straightforward to write a small GLM package in D - I'd use 
quite a different approach with structs/classes GLM objects to 
remove the copying issues and to give a consistent interface to 
the user.

An additional takeaway for me was that I also found the use of 
array operations like

a[] = b[]*c[]

or

d[] -= e[] -f

created odd effects in my calculations the outputs were wrong and 
for ages I didn't know why but later ended up removing those 
expressions from the code all together - which remedied the 
problem.