-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified cholesky #153
Comments
CC @poulson |
Dynamic regularization can be dangerous: http://web.stanford.edu/~poulson/slides/LBNL15.pdf |
I've a blocked Cholesky in the package that is faster than LAPACK for the 400x400 problem. We could try to combine my version with yours to get a regularized blocked LDLt. |
The challenge will be replacing the |
Wonderful if it could be a theoretically well-argued alternative to my "dead" #123 :) |
I'll submit to your package, @andreasnoack. @poulson, thanks for the link. I have noticed that one does occassionally get bad condition numbers. On brief perusal I couldn't quite tell what the recommended solution is. In
there are some approaches that give much better condition numbers (e.g., the GMW-I algorithm, see Fig. 1). I haven't implemented those techniques (yet), though. |
The trick is to use a constant diagonal shift and use the factorization as a preconditioner if necessary. To use a constant diagonal shift, factor, compute the dominant entry of E, then shift the original matrix by it and refactor. Jennifer Scott said that HSL does something similar. |
Not sure I follow entirely. By "factor", which factorization? Are you basically saying I should go ahead and use the modified cholesky factorization, but that after computing More philosophically, I do have conceptual reservations about any strategy that uses constant diagonal shifts. These can be illustrated by attaching physical units to the entries in the parameter vector. For example, if the first parameter is measured in meters and the second in seconds, the units of your step are Of course, it's also true that for the purposes of computation, all parameters are rescaled into (dimensionless) floating-point numbers, and roundoff does introduce some absolute scale. At the end of the day, it seems inevitable that one has to introduce a quantity which truncates at this scale. But it seems attractive not to have it affect any more parameters than necessary (if that's even possible). |
@timholy Are you not first equilibrating your inputs before running your optimization procedure? This is often absolutely crucial for solving the netlib LP problems. If so, all of the SI issues are no longer relevant, as the equilibration would undo unit changes. And I would suggest using a shift of the form |
Apologies in advance for a long post, but hopefully there are some interesting ideas here (someone might get a publication out of this 😄). Rescaling is useful, but besides my real point. By mentally attaching units to things, you start thinking about problems more clearly---it tells you that certain operations don't make mathematical sense except in the artificial world in which everything gets translated into floating-point numbers. You might then start to worry that the answers you'll get are strongly dependent on your choice for how that translation occurs, and this might encourage you to come up with better alternatives. In the context of optimization, I ran into a nice example of this in implementing the Hager and Zhang linesearch ( The more I think about it, the more I think the direction taken in most of the literature here is simply wrong. Let's take a 1d example: suppose you're trying to minimize, but locally your function is concave. At your current search point, let How would we do better? I'd argue that the right step is Generalizing this to multiple dimensions, it suggests the "right" answer is the following:
The last choice is very similar to the typical usage of the SVD for inversion, in which one drops the tiny singular values. Except rather than equating 1/0 with 0, here we're equating 1/0 with a small positive number. What I like about this procedure is that:
The only downside I see---and it's a big one---is that the eigendecomposition is extraordinarily expensive. It would be great to find a Cholesky-like analog. Want to see if we can come up with something? |
If you are unhappy with the shift function I am a bit surprised that you are suggesting to take the absolute value of potentially very negative eigenvalues, as this would be a very large perturbation. If you instead preserved their sign (though this would not yield an SPD result), your formula would be essentially a small perturbation of a (numerical) pseudoinverse, whose application is roughly equivalent to Tikhonov regularization with a small damping parameter. There are algorithms (especially from Saunders) that solve regularized least squares using sparse Cholesky-like factorizations (see http://web.stanford.edu/group/SOL/papers/seattleproc.pdf). Perhaps you could take the absolute value of the D in the LDL factorization of such a formulation to achieve an analogue of your proposed technique. Also, please excuse the major edit. |
When the matrix is already (nearly) positive-definite, what I proposed is indeed the pseudoinverse. But when it's far from positive-definite, it's nowhere close to the pseudoinverse/Tikhonov regularization. That's kinda the point---what I'm saying is that you don't actually want to solve the KKT system, or a minimal perturbation, when you have large negative eigenvalues. Instead, you want to solve a large perturbation 😄. For the purposes of making discussion/visualization easy, let's suppose your Hessian is actually diagonal, so the eigenvectors are just the coordinate axes. Let's also suppose that some of the entries are large in magnitude (way, way beyond roundoff error) but negative. Then the minimally-stable inverse, by the "usual" approach (Tikhonov-like regularization), is The deep problem with this approach is that it's conflating something you know with great confidence (eigenvalues of very large magnitude, which just happen to be negative) with roundoff error. Since you actually know the value of those large, negative eigenvalues, why wouldn't you want to use them? Certainly, you don't want to go zooming off towards infinity along the corresponding axes, because you know, with great confidence, that your objective function has curvature along those directions (it just happens to be "upside down"). The only directions you might consider taking really big steps along are those for which the corresponding eigenvalue is truly small in magnitude, because you know that things won't change very much unless you take big steps along those axes. However, if you do Tikhonov-like regularization, what originally were small-magnitude eigenvalues are no longer near 0, so you take a step that is much too small along those directions. In other words, Tikhonov regularization only gets the right answer for the directions with largest positive eigenvalues; all others are essentially "inverted" from what they should be (small steps along directions that need a big step, and big steps along directions that need small steps). Bingo, you've got very slow convergence. To say it differently, to seek a factorization of |
I think you responded to my original comment rather than the edit. |
Yes, sorry about that---I must have written it while disconnected from the internet.
This is the crux of the issue. I am indeed arguing that this is the right thing to do. And yes, I'm aware this is unconventional. To be very explicit: # Objective that, near [0,0], has one negative eigenvalue (dimension 1)
# and one positive eigenvalue (dimension 2)
objective(x) = (-x[1]^2 + x[1]^6) + (x[2]-3)^2
grad(x) = [-2x[1] + 6*x[1]^5, 2*(x[2]-3)]
hess(x) = [-2 + 30*x[1]^4 0;
0 2]
function tikhonov_step(g, H, δ)
D, V = eig(H)
Dmin = minimum(D)
if Dmin < δ
H = H + (δ-Dmin)*I
end
-H\g
end
function holy_step(g, H, δ)
D, V = eig(H)
Dabs = abs(D)
Dmax = maximum(Dabs)
if Dmax == 0
return -g
end
reliable = Dabs .> δ*Dmax
Dinv = similar(D)
Dinv[reliable] = 1./Dabs[reliable]
Dinv[!reliable] = 1/Dmax
-(V*(Dinv .* (V'*g)))
end
xlist = Any[[-0.001,-0.001],[-0.01,-0.01],[-0.1,-0.1],[-1.0,-1.0]]
δ = sqrt(eps())
println("Tikhonov")
for x in xlist
g = grad(x)
H = hess(x)
xstep = tikhonov_step(g, H, δ)
@show x xstep objective(x) objective(x+xstep)
end
println("\nHoly")
for x in xlist
g = grad(x)
H = hess(x)
xstep = holy_step(g, H, δ)
@show x xstep objective(x) objective(x+xstep)
end Results:
With Tikhonov, any time you start somewhere in the "stripe" between -0.6 and 0.6 along dimension 1, you'll spend a lot of time backtracking. With mine, every step reduces the objective function. When used iteratively,
For mine, the trickiest spot occurs near where the 2nd derivative vanishes along dimension 1: julia> x = [(1/15)^(1/4), 0]
2-element Array{Float64,1}:
0.508133
0.0
julia> H = hess(x)
2x2 Array{Float64,2}:
0.0 0.0
0.0 2.0
julia> g = grad(x)
2-element Array{Float64,1}:
-0.813012
-6.0
julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
0.406506
3.0
julia> x = [(1/15)^(1/4)+0.0001, 0]
2-element Array{Float64,1}:
0.508233
0.0
julia> H = hess(x)
2x2 Array{Float64,2}:
0.00157486 0.0
0.0 2.0
julia> g = grad(x)
2-element Array{Float64,1}:
-0.813012
-6.0
julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
516.245
3.0
julia> x = [(1/15)^(1/4)+0.01, 0]
2-element Array{Float64,1}:
0.518133
0.0
julia> H = hess(x)
2x2 Array{Float64,2}:
0.162148 0.0
0.0 2.0
julia> g = grad(x)
2-element Array{Float64,1}:
-0.81221
-6.0
julia> holy_step(g, H, δ)
2-element Array{Float64,1}:
5.00906
3.0 You have to do some backtracking, but it's not to the tune of Compare Tikhonov at the same point: julia> x = [(1/15)^(1/4), 0]
# omitted
julia> tikhonov_step(g, H, δ)
2-element Array{Float64,1}:
5.45603e7
3.0 which seems much worse. Also note that my algorithm is much more stable: suppose the function has many local minima, and the user (by dint of extraordinary insight) has positioned the starting point in the basin of the global minimum, however on a portion of the slope that is concave along some dimensions. Mine will remain in the same basin, because it avoids making crazy-large jumps along directions that have large negative eigenvalues. With Tikhonov, all bets are off. One important point is that backtracking is, of course, much cheaper than inverting the Hessian---so maybe you don't care about a lot of backtracking. However, with Tikhonov regularization, note that until you "turn the corner" on your most negative eigenvalue, you're not really making any substantial progress on any of the other negative eigenvalues. So if you have many negative eigenvalues, you're essentially doing coordinatewise descent until you make them all positive. With mine, you make progress on all of them simultaneously, albeit with a fresh Newton step on each iteration. |
This is fun, by the way. Thanks for the conversation so far. |
Have you tried taking the absolute value of D in LDL factorization of an augmented form of Tikhonov least squares as a cheaper analogue of taking the absolute value of the sufficiently large eigenvalues? It might be a reasonable compromise, as a full eigenvalue decomposition is unfortunately unreasonable for a large sparse problem. |
Totally agreed on the fact that one needs a cheaper implementation of this idea. Haven't played around with this yet. Any code you'd suggest trying? |
The augmented system formulation is really easy to form (using the regularized least squares formulation from http://web.stanford.edu/group/SOL/papers/seattleproc.pdf) and is meant to be driven by an unpivoted LDL factorization. As soon as you can form a matrix like |
I've written a pure-julia implementation of a modified cholesky factorization (GMW81), which can be used in Newton-like methods to guarantee descent. (It's worth noting that this line might not choose a descent direction if
H
is not positive-definite.) For those who don't know, this computes an LDLT factorization of a matrixH+E
whereE
is a diagonal matrix that is "as small as possible" yet makesH
positive-definite. WhenH
is already positive-definite,E
is typically zero.In testing (matrix size 400x400) it's about 4-5x slower than
cholfact
, which is pretty good considering I'm going up against multithreaded BLAS. It's also 4-5x faster thaneigfact
, which presents an alternative way to carry out this operation. The optimizations I've done so far consisted of adding@inbounds
and@simd
in a couple places, so it's quite possible one could do even better.I'd be happy to share it. Do we have a good package in which to stash such things? CC @jiahao and @ViralBShah, who do not watch this repo but whose work on IterativeSolvers is at least conceptually related.
The text was updated successfully, but these errors were encountered: