Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a modified Newton's algorithm to handle both convex and concave… #123

Closed
wants to merge 1 commit into from

Conversation

pkofod
Copy link
Member

@pkofod pkofod commented Jun 10, 2015

… regions.

This is an attempt to resolve some of the problems with Newton's method going in the wrong direction. Related to #104 and my discussion with @johnmyleswhite there. To see that it works with the original starting values, run

using Optim
prob=Optim.UnconstrainedProblems.examples["Himmelblau"]
ddf = TwiceDifferentiableFunction(prob.f, prob.g!,prob.h!)
res = Optim.modified_newton(ddf, [0.,0.])
@assert norm(res.minimum - prob.solutions) < 1e-2

@johnmyleswhite
Copy link
Contributor

This seems very promising. I've given you commit access to give you a little more freedom to work on projects like this, although do please exercise caution when merging changes. Ideally, we'll still do everything via explicit merge requests and review. I just want to lower friction slightly for you and show you that we appreciate the work you've been doing.

Do you have some academic citations to help me understand why this approach works?

@timholy
Copy link
Contributor

timholy commented Jun 10, 2015

It looks like it follows the time-honored tradition of forcing the Hessian to be positive-definite. Is there literature suggesting this is the best approach? I would have naively thought that adding something proportional to the diagonal would be a better strategy; if you think of each variable as having units, there's no guarantee they are scaled the same way.

@pkofod
Copy link
Member Author

pkofod commented Jun 10, 2015

You both raise interesting questions - I kind of expected that. I just wanted to get a "globalized Newton's Method" started, since I've seen the linesearch question (caused by direction of ascent by Newton's Method) many times.

As @timholy mentions, this is simply the good old "force pos-def" by adding a number to the diagonal of the Hessian. I chose it because it was simple, and at the very least as a proof of concept. The simplicity is a strength, but of course there are also drawbacks as Nocedal and Wright mention:

"This strategy is quite simple and may be preferable to the modified factorization techniques described next, but it suffers from two drawbacks. The value of τ generated by this procedure may be unnecessarily large, which would bias the modified Newton direction too much toward the steepest descent direction. In addition, every value of τ_k requires a new factorization of A + τ_k*I, which can be quite expensive if several trial values are generated. (The symbolic factorization of A is performed only once, but the numerical factorization must be performed from scratch every time."

@timholy
Copy link
Contributor

timholy commented Jun 10, 2015

Of relevance to the considerations of symbolic/numerical factorization (which refers to sparse matrices), cholfact allows a shift keyword which only supports A+τ*I for scalar τ. (In julia 0.3, this is cholfact(A, τ) rather than cholfact(A, shift=τ)). This would argue for your strategy, and over the longer term some manual effort to exploit separating the analysis phase from the computation phase.

@pkofod
Copy link
Member Author

pkofod commented Jun 10, 2015

I didn't know that @timholy, although I can see it in the v0.4 docs. On v"0.4.0-dev+4986" I get

ERROR: unrecognized keyword argument "shift"

@timholy
Copy link
Contributor

timholy commented Jun 10, 2015

I was just going by what's in the docs, too, and only noticed this a few days ago myself. But I get:

julia> A = sprand(10,10,0.05)
10x10 sparse matrix with 5 Float64 entries:
        [1 ,  1]  =  0.497277
        [7 ,  1]  =  0.650883
        [3 ,  2]  =  0.626572
        [4 ,  5]  =  0.984572
        [2 ,  9]  =  0.123467

julia> B = A'*A
10x10 sparse matrix with 4 Float64 entries:
        [1 ,  1]  =  0.670933
        [2 ,  2]  =  0.392592
        [5 ,  5]  =  0.969383
        [9 ,  9]  =  0.0152441

julia> F = cholfact(B, shift=0.1)
Base.SparseMatrix.CHOLMOD.Factor{Float64,Int64}
type:          LLt
method: simplicial
maxnnz:         10
nnz:            10

julia> b = rand(10);

julia> F\b
10-element Array{Float64,1}:
 0.395839
 1.06682 
 2.31732 
 0.69799 
 0.245143
 2.6186  
 6.78997 
 6.38511 
 2.73964 
 0.411796

julia> full(B+0.1*I)\b
10-element Array{Float64,1}:
 0.395839
 1.06682 
 2.31732 
 0.69799 
 0.245143
 2.6186  
 6.78997 
 6.38511 
 2.73964 
 0.411796

which seems like it must have worked? (Note: only relevant for sparse matrices.)

@pkofod
Copy link
Member Author

pkofod commented Jun 10, 2015

I get it. I was trying to do it for a dense matrix. Why isn't that keyword available for dense matrices? @andreasnoack

I will see if I can find some literature (hopefully there are survey articles out there...), but I am open to trying different Hessian modifications instead of this simple versions. On the other hand, I have a gut feeling that this might work reasonably well for small size problems. However, this is hard to say without some benchmarking. As I said earlier, I really did this PR to get things moving, so I'm open to suggestions (and willing to throw some time at it) :)

@timholy
Copy link
Contributor

timholy commented Jun 10, 2015

For sparse matrices, there are some opportunities that involve making decisions based on the pattern of sparsity, as these decisions amortize over different numerical values for those nonzero entries.

For dense matrices, there are no decisions to make: you just plug and chug. So there's no advantage in having such syntax for dense matrices (it's "all numerical").

@pkofod
Copy link
Member Author

pkofod commented Jun 10, 2015

I guess its just a matter of how often it is used or not. As a shorthand, I guess I could still be there, although it doesn't add any numerical advantages.

@pkofod pkofod mentioned this pull request Dec 1, 2015
@pkofod
Copy link
Member Author

pkofod commented Feb 29, 2016

I can't exactly remember my plans with this. The fastest way to get a Newton's method that doesn't fail in concave regions would be to use https://github.com/timholy/PositiveFactorizations.jl . The clear con would be to add a dependency. Have you played around with this since you made the package @timholy ? Are you still certain that it performs better? If it is indeed better, I would be happy to use it instead of the current "solution".

What do you say @johnmyleswhite - is adding a dependency a big no-no?

@timholy
Copy link
Contributor

timholy commented Feb 29, 2016

For me it's both fast and reliable. See the teh/constrained branch for an implementation that uses it.

@pkofod
Copy link
Member Author

pkofod commented Feb 29, 2016

Sounds good. If it's used in teh/constrained, I guess we can use it here as well. I just think we kind of need a Newton's method that doesn't throw a linesearch error in concave regions :) If the correction is close to zero for positive definite matrices, I actually think it should be our standard "Newton", but that is of course up for discussion.

@johnmyleswhite
Copy link
Contributor

I trust @timholy's judgment here.

@timholy
Copy link
Contributor

timholy commented Mar 1, 2016

If the correction is close to zero for positive definite matrices, I actually think it should be our standard "Newton", but that is of course up for discussion.

It's not just "close to zero," it is zero, if the Hessian is positive definite.

Performance-wise, I seem to remember it was within a factor of 2 of LAPACK's Cholesky factorization, though of course this would be worth verifying independently. If you're looking for reasons not to make it the default, that factor of 2 (if it's there) is the only good reason I can think of. But it's worth pointing out that you can't efficiently find out whether the Hessian is positive definite until you start trying to Cholesky factor it. If it's not positive definite, then you throw an error, and have to start from scratch. In contrast, this algorithm just deals with problematic diagonals (those <= 0) as it goes; if it never encounters any, then it returns the standard Cholesky decomposition.

@pkofod
Copy link
Member Author

pkofod commented Mar 1, 2016

Okay, I must've read the README over there a bit too fast. Looks good. And a factor 2 is probably no worse than going uphill :) The "fix" in this PR currently is most certainly very slow, so I'm not too worried about that. Orders of magnitude would of course be a bit problematic for larger cases. I'll try it out and report back.

@pkofod
Copy link
Member Author

pkofod commented Mar 1, 2016

Closed in favour of #177

@pkofod pkofod closed this Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants