Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified cholesky #153

Closed
timholy opened this issue Dec 1, 2015 · 20 comments
Closed

Modified cholesky #153

timholy opened this issue Dec 1, 2015 · 20 comments

Comments

@timholy
Copy link
Contributor

timholy commented Dec 1, 2015

I've written a pure-julia implementation of a modified cholesky factorization (GMW81), which can be used in Newton-like methods to guarantee descent. (It's worth noting that this line might not choose a descent direction if H is not positive-definite.) For those who don't know, this computes an LDLT factorization of a matrix H+E where E is a diagonal matrix that is "as small as possible" yet makes H positive-definite. When H is already positive-definite, E is typically zero.

In testing (matrix size 400x400) it's about 4-5x slower than cholfact, which is pretty good considering I'm going up against multithreaded BLAS. It's also 4-5x faster than eigfact, which presents an alternative way to carry out this operation. The optimizations I've done so far consisted of adding @inbounds and @simd in a couple places, so it's quite possible one could do even better.

I'd be happy to share it. Do we have a good package in which to stash such things? CC @jiahao and @ViralBShah, who do not watch this repo but whose work on IterativeSolvers is at least conceptually related.

@mlubin
Copy link
Contributor

mlubin commented Dec 1, 2015

CC @poulson

@ViralBShah
Copy link
Contributor

@poulson
Copy link

poulson commented Dec 1, 2015

Dynamic regularization can be dangerous: http://web.stanford.edu/~poulson/slides/LBNL15.pdf

@andreasnoack
Copy link
Contributor

I've a blocked Cholesky in the package that is faster than LAPACK for the 400x400 problem. We could try to combine my version with yours to get a regularized blocked LDLt.

@poulson
Copy link

poulson commented Dec 1, 2015

The challenge will be replacing the zherk calls, as there is no analogous BLAS routine for LDL.

@pkofod
Copy link
Member

pkofod commented Dec 1, 2015

Wonderful if it could be a theoretically well-argued alternative to my "dead" #123 :)

@timholy
Copy link
Contributor Author

timholy commented Dec 1, 2015

I'll submit to your package, @andreasnoack.

@poulson, thanks for the link. I have noticed that one does occassionally get bad condition numbers. On brief perusal I couldn't quite tell what the recommended solution is. In

Fang, Haw-ren, and Dianne P. O’leary. "Modified Cholesky algorithms: a catalog with new approaches." Mathematical Programming 115.2 (2008): 319-349.

there are some approaches that give much better condition numbers (e.g., the GMW-I algorithm, see Fig. 1). I haven't implemented those techniques (yet), though.

@poulson
Copy link

poulson commented Dec 2, 2015

The trick is to use a constant diagonal shift and use the factorization as a preconditioner if necessary. To use a constant diagonal shift, factor, compute the dominant entry of E, then shift the original matrix by it and refactor. Jennifer Scott said that HSL does something similar.

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

Not sure I follow entirely. By "factor", which factorization? Are you basically saying I should go ahead and use the modified cholesky factorization, but that after computing E I (1) throw the LDLT factorization away, (2) add abs(minimum(E))*I to the matrix, and (3) re-factor by standard Cholesky? If you shift by the most negative eigenvalue, won't you have a zero eigenvalue and hence a large (infinite) condition number?

More philosophically, I do have conceptual reservations about any strategy that uses constant diagonal shifts. These can be illustrated by attaching physical units to the entries in the parameter vector. For example, if the first parameter is measured in meters and the second in seconds, the units of your step are [m,s], the units of your gradient are [1/m,1/s] (assuming your objective function is dimensionless), and the units of H are [1/m^2 1/(m s); 1/(m s) 1/s^2]. Note that s = -H\g is perfectly sensible even when you add these units. But in this situation, H+λI is nonsensical.

Of course, it's also true that for the purposes of computation, all parameters are rescaled into (dimensionless) floating-point numbers, and roundoff does introduce some absolute scale. At the end of the day, it seems inevitable that one has to introduce a quantity which truncates at this scale. But it seems attractive not to have it affect any more parameters than necessary (if that's even possible).

@poulson
Copy link

poulson commented Dec 2, 2015

@timholy Are you not first equilibrating your inputs before running your optimization procedure? This is often absolutely crucial for solving the netlib LP problems. If so, all of the SI issues are no longer relevant, as the equilibration would undo unit changes.

And I would suggest using a shift of the form || E ||_max + || A ||_2 eps^alpha, with alpha somewhere in [0.5,1], as this would bound the condition number of the shifted matrix by eps^(-alpha).

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

Apologies in advance for a long post, but hopefully there are some interesting ideas here (someone might get a publication out of this 😄).

Rescaling is useful, but besides my real point. By mentally attaching units to things, you start thinking about problems more clearly---it tells you that certain operations don't make mathematical sense except in the artificial world in which everything gets translated into floating-point numbers. You might then start to worry that the answers you'll get are strongly dependent on your choice for how that translation occurs, and this might encourage you to come up with better alternatives. In the context of optimization, I ran into a nice example of this in implementing the Hager and Zhang linesearch (hz_linesearch.jl). In reading HZ2005, I remember noting that their formula for updating β didn't make sense from a units perspective, and this bothered me until I read HZ2012 in which they introduced a new update procedure (Eq. 2.4) that did make units-sense. Unsurprisingly, they found empirically that the new procedure also performed better.

The more I think about it, the more I think the direction taken in most of the literature here is simply wrong. Let's take a 1d example: suppose you're trying to minimize, but locally your function is concave. At your current search point, let g be the gradient (since it's 1d, let's make this a number, not a vector), and let h be the second derivative. What should your step size be? If h were positive, your step would be -g/h. Applying your formula above, ||E||_max = ||A||_2 = |h|, so we'd get -g/(|h| * eps^alpha). I bet you'd agree this leads to a terrible choice for the next search point.

How would we do better? I'd argue that the right step is -g/|h|. Note that |g/h| estimates the distance to the maximum---it is a quantity that has the right units (length) which tells you something about the scale over which you hope you understand the shape of your objective function. From this you have no idea where the "turning point" occurs, because this isn't captured by a locally-quadratic fit, but it does give you a sense of scale in the units in which your parameters are measured---indeed, it's the only quantity at your disposal that has these units. To test this idea, let's imagine you use this choice on a function that's actually concave quadratic: of course this function has no minimum, so it's "broken," but it's informative as a way of learning the implications for your search strategy. It's trivial to show that this choice leads you to double your distance from the maximum on each iteration, sort of an "inverse binary search" strategy. In the absence of additional information, I think it's hard to come up with a better strategy than that.

Generalizing this to multiple dimensions, it suggests the "right" answer is the following:

  • Perform an eigendecomposition of H, H = V*D*V'.
  • For the purposes of optimization, define the "Newton-inverse" ninv(H) = V*Dinv*V', where Dinv is some diagonal matrix. Our only freedom is to choose the values of this diagonal. Of course, for the purposes of finding a descent direction, they should be nonnegative.
  • For eigenvalues that are "reliable," take their absolute value. Let's trust any eigenvalue with absolute value > δ |D|_max, where δ is something expressing numerical precision (e.g., sqrt(eps)). (Such eigenvalues should be reasonably uncontaminated by roundoff error.) For these eigenvalues, Dinv is the obvious 1./abs(D).
  • For the rest, define their Dinv value as δ/|D|_max. Note these are small values, not "near-infinite" ones.

The last choice is very similar to the typical usage of the SVD for inversion, in which one drops the tiny singular values. Except rather than equating 1/0 with 0, here we're equating 1/0 with a small positive number. What I like about this procedure is that:

  • it has no impact on "reliable" dimensions, except to make them positive-definite as needed. Note that taking the absolute value does not destroy "units" information, so this makes me happy 😄
  • while you destroy "units" information in the unreliable dimensions, you know from floating-point roundoff that you can't trust these. So you might as well search in the direction of the gradient, a bit, and see whether your next evaluation point provides you with more information.

The only downside I see---and it's a big one---is that the eigendecomposition is extraordinarily expensive. It would be great to find a Cholesky-like analog. Want to see if we can come up with something?

@poulson
Copy link

poulson commented Dec 2, 2015

If you are unhappy with the shift function f(eps,ANorm,EMax,alpha) = EMax + ANorm*eps^alpha, why not let this be the default but let the user override it?

I am a bit surprised that you are suggesting to take the absolute value of potentially very negative eigenvalues, as this would be a very large perturbation.

If you instead preserved their sign (though this would not yield an SPD result), your formula would be essentially a small perturbation of a (numerical) pseudoinverse, whose application is roughly equivalent to Tikhonov regularization with a small damping parameter. There are algorithms (especially from Saunders) that solve regularized least squares using sparse Cholesky-like factorizations (see http://web.stanford.edu/group/SOL/papers/seattleproc.pdf). Perhaps you could take the absolute value of the D in the LDL factorization of such a formulation to achieve an analogue of your proposed technique.

Also, please excuse the major edit.

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

When the matrix is already (nearly) positive-definite, what I proposed is indeed the pseudoinverse. But when it's far from positive-definite, it's nowhere close to the pseudoinverse/Tikhonov regularization. That's kinda the point---what I'm saying is that you don't actually want to solve the KKT system, or a minimal perturbation, when you have large negative eigenvalues. Instead, you want to solve a large perturbation 😄.

For the purposes of making discussion/visualization easy, let's suppose your Hessian is actually diagonal, so the eigenvectors are just the coordinate axes. Let's also suppose that some of the entries are large in magnitude (way, way beyond roundoff error) but negative. Then the minimally-stable inverse, by the "usual" approach (Tikhonov-like regularization), is inv(H + λ*I), where λ is eps bigger than abs(minimum(diag(H))).

The deep problem with this approach is that it's conflating something you know with great confidence (eigenvalues of very large magnitude, which just happen to be negative) with roundoff error. Since you actually know the value of those large, negative eigenvalues, why wouldn't you want to use them? Certainly, you don't want to go zooming off towards infinity along the corresponding axes, because you know, with great confidence, that your objective function has curvature along those directions (it just happens to be "upside down"). The only directions you might consider taking really big steps along are those for which the corresponding eigenvalue is truly small in magnitude, because you know that things won't change very much unless you take big steps along those axes.

However, if you do Tikhonov-like regularization, what originally were small-magnitude eigenvalues are no longer near 0, so you take a step that is much too small along those directions. In other words, Tikhonov regularization only gets the right answer for the directions with largest positive eigenvalues; all others are essentially "inverted" from what they should be (small steps along directions that need a big step, and big steps along directions that need small steps). Bingo, you've got very slow convergence.

To say it differently, to seek a factorization of H+E, choosing an E "just barely big enough to make their sum positive-definite", sounds good at first, and indeed the literature on the topic has developed good methods to achieve this goal. But the point of the analysis above is that this goal turns out to be misguided: whenever norm(H) is dominated by a large negative eigenvalue, you actually want norm(E) ≈ 2*norm(H), which is definitely not "as small as possible."

@poulson
Copy link

poulson commented Dec 2, 2015

I think you responded to my original comment rather than the edit.

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

Yes, sorry about that---I must have written it while disconnected from the internet.

I am a bit surprised that you are suggesting to take the absolute value of potentially very negative eigenvalues, as this would be a very large perturbation.

This is the crux of the issue. I am indeed arguing that this is the right thing to do. And yes, I'm aware this is unconventional.

To be very explicit:

# Objective that, near [0,0], has one negative eigenvalue (dimension 1)
# and one positive eigenvalue (dimension 2)
objective(x) = (-x[1]^2 + x[1]^6)  +  (x[2]-3)^2
grad(x) = [-2x[1] + 6*x[1]^5, 2*(x[2]-3)]
hess(x) = [-2 + 30*x[1]^4 0;
            0             2]

function tikhonov_step(g, H, δ)
    D, V = eig(H)
    Dmin = minimum(D)
    if Dmin < δ
        H = H +-Dmin)*I
    end
    -H\g
end

function holy_step(g, H, δ)
    D, V = eig(H)
    Dabs = abs(D)
    Dmax = maximum(Dabs)
    if Dmax == 0
        return -g
    end
    reliable = Dabs .> δ*Dmax
    Dinv = similar(D)
    Dinv[reliable] = 1./Dabs[reliable]
    Dinv[!reliable] = 1/Dmax
    -(V*(Dinv .* (V'*g)))
end

xlist = Any[[-0.001,-0.001],[-0.01,-0.01],[-0.1,-0.1],[-1.0,-1.0]]
δ = sqrt(eps())
println("Tikhonov")
for x in xlist
    g = grad(x)
    H = hess(x)
    xstep = tikhonov_step(g, H, δ)
    @show x xstep objective(x) objective(x+xstep)
end

println("\nHoly")
for x in xlist
    g = grad(x)
    H = hess(x)
    xstep = holy_step(g, H, δ)
    @show x xstep objective(x) objective(x+xstep)
end

Results:

Tikhonov
x = [-0.001,-0.001]
xstep = [-134217.72799959735,1.5004999944214557]
objective(x) = 9.006
objective(x + xstep) = 5.846006810555245e30
x = [-0.01,-0.01]
xstep = [-1.3421772397346816e6,1.5050001072684456]
objective(x) = 9.060000000000999
objective(x + xstep) = 5.846005758379335e36
x = [-0.1,-0.1]
xstep = [-1.341774626816e7,1.551163366746526]
objective(x) = 9.600001
objective(x + xstep) = 5.83549188743263e42
x = [-1.0,-1.0]
xstep = [0.14285714285714285,4.0]
objective(x) = 16.0
objective(x + xstep) = -0.33812442094705436

Holy
x = [-0.001,-0.001]
xstep = [-0.001000000000012,3.001]
objective(x) = 9.006
objective(x + xstep) = -3.999999999984e-6
x = [-0.01,-0.01]
xstep = [-0.01000000120000018,3.01]
objective(x) = 9.060000000000999
objective(x + xstep) = -0.00039999998399998555
x = [-0.1,-0.1]
xstep = [-0.10012018027040562,3.1]
objective(x) = 9.600001
objective(x + xstep) = -0.03998385545842348
x = [-1.0,-1.0]
xstep = [0.14285714285714285,4.0]
objective(x) = 16.0
objective(x + xstep) = -0.33812442094705436

With Tikhonov, any time you start somewhere in the "stripe" between -0.6 and 0.6 along dimension 1, you'll spend a lot of time backtracking. With mine, every step reduces the objective function. When used iteratively,

  • along dimension 2, it jumps straight to the minimum
  • along dimension 1, each iteration (after the first) should nearly double its step size until the Hessian starts to become positive again, progressing smoothly towards the minimum at [1/3^(1/4), 3].

For mine, the trickiest spot occurs near where the 2nd derivative vanishes along dimension 1:

julia> x = [(1/15)^(1/4), 0]
2-element Array{Float64,1}:
 0.508133
 0.0     

julia> H = hess(x)
2x2 Array{Float64,2}:
 0.0  0.0
 0.0  2.0

julia> g = grad(x)
2-element Array{Float64,1}:
 -0.813012
 -6.0     

julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
 0.406506
 3.0     

julia> x = [(1/15)^(1/4)+0.0001, 0]
2-element Array{Float64,1}:
 0.508233
 0.0     

julia> H = hess(x)
2x2 Array{Float64,2}:
 0.00157486  0.0
 0.0         2.0

julia> g = grad(x)
2-element Array{Float64,1}:
 -0.813012
 -6.0     

julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
 516.245
   3.0  

julia> x = [(1/15)^(1/4)+0.01, 0]
2-element Array{Float64,1}:
 0.518133
 0.0     

julia> H = hess(x)
2x2 Array{Float64,2}:
 0.162148  0.0
 0.0       2.0

julia> g = grad(x)
2-element Array{Float64,1}:
 -0.81221
 -6.0    

julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
 5.00906
 3.0    

You have to do some backtracking, but it's not to the tune of 1/δ.

Compare Tikhonov at the same point:

julia> x = [(1/15)^(1/4), 0]

# omitted

julia> tikhonov_step(g, H, δ)
2-element Array{Float64,1}:
 5.45603e7
 3.0      

which seems much worse.

Also note that my algorithm is much more stable: suppose the function has many local minima, and the user (by dint of extraordinary insight) has positioned the starting point in the basin of the global minimum, however on a portion of the slope that is concave along some dimensions. Mine will remain in the same basin, because it avoids making crazy-large jumps along directions that have large negative eigenvalues. With Tikhonov, all bets are off.

One important point is that backtracking is, of course, much cheaper than inverting the Hessian---so maybe you don't care about a lot of backtracking. However, with Tikhonov regularization, note that until you "turn the corner" on your most negative eigenvalue, you're not really making any substantial progress on any of the other negative eigenvalues. So if you have many negative eigenvalues, you're essentially doing coordinatewise descent until you make them all positive. With mine, you make progress on all of them simultaneously, albeit with a fresh Newton step on each iteration.

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

This is fun, by the way. Thanks for the conversation so far.

@poulson
Copy link

poulson commented Dec 2, 2015

Have you tried taking the absolute value of D in LDL factorization of an augmented form of Tikhonov least squares as a cheaper analogue of taking the absolute value of the sufficiently large eigenvalues? It might be a reasonable compromise, as a full eigenvalue decomposition is unfortunately unreasonable for a large sparse problem.

@timholy
Copy link
Contributor Author

timholy commented Dec 2, 2015

Totally agreed on the fact that one needs a cheaper implementation of this idea. Haven't played around with this yet. Any code you'd suggest trying?

@poulson
Copy link

poulson commented Dec 2, 2015

The augmented system formulation is really easy to form (using the regularized least squares formulation from http://web.stanford.edu/group/SOL/papers/seattleproc.pdf) and is meant to be driven by an unpivoted LDL factorization. As soon as you can form a matrix like [delta*I, A; A^T; -delta*I] and run LDL (and then modify D) you are good to go.

@timholy
Copy link
Contributor Author

timholy commented Jan 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants