Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifold optimization #433

Closed
antoine-levitt opened this issue Jun 17, 2017 · 28 comments
Closed

Manifold optimization #433

antoine-levitt opened this issue Jun 17, 2017 · 28 comments

Comments

@antoine-levitt
Copy link
Contributor

A number of problems involve minimizing a function on a Riemannian manifold defined by geometric constraints. An example is the eigenvalue problem by the minimization of the Rayleigh quotient under orthogonality constraints, nonlinear models in quantum chemistry (e.g. DFT), and several problems in matrix approximation. There are several toolboxes available: ROPTLIB has bindings to Julia, and there's https://github.com/NickMcNutt/ManifoldOptim.jl but it's abandoned (I emailed the author). One issue with these efforts is that they basically duplicate code from a general unconstrained optimizer.

As far as I can tell, the adaptation of an unconstrained optimizer to Riemannian optimization is pretty easy (at least for first-order algorithms): basically, it amounts to projecting the gradient onto the tangent manifold immediately after computing it, and retracting iterates back to the manifold after each step. Therefore, an implementation in Optim.jl would not be very intrusive. There would be a Manifold type with subtypes (e.g. Flat, Sphere, Stiefel...), functions like retract(M::Manifold, x) implemented for every subtype, and there would be an manifold option in Options, defaulting to Flat.

This issue is here to gauge interest: if I submit a PR implementing bare bones of this (e.g. gradient descent and CG for the sphere and stiefel manifold, and tests for the linear eigenvalue problem), will it get merged? The only potential issue I can see is that it does involve more code to develop/test for every kind of optimizer, but all of them don't have to support manifolds, and support can be added incrementally as interest arises.

(cc @cortner, who might be interested in the functionality)

@cortner
Copy link
Contributor

cortner commented Jun 17, 2017

Definitely interested. And have thought about it. Just never became urgent enough. I'd also like to add the suggestion to add curved linesearch

@pkofod
Copy link
Member

pkofod commented Jun 17, 2017

This issue is here to gauge interest: if I submit a PR implementing bare bones of this (e.g. gradient descent and CG for the sphere and stiefel manifold, and tests for the linear eigenvalue problem), will it get merged?

Sure, why not. By your description, I think this is very easy to implement in a way that does not affect the compiled code of our curent algorithms for our current types of problems at all (empty fallbacks etc).

Is there some sort of intro reference for this? It is not my field at all!

@timholy
Copy link
Contributor

timholy commented Jun 17, 2017

How does this differ from general equality-constrained optimization? g(x) = 0 defines a manifold.

@ChrisRackauckas
Copy link
Contributor

It can differ because you can force the linesearch to stay within the manifold, instead of searching the full space but relaxing to the manifold via a constraint. But arguably the best way to do this is on the user's side: choosing coordinates which are themselves constrained to be in the manifold (which would then force this on the linesearch). Or you can project to the manifold with each change, adding a nonlinear solve between each step. This can be required though if the equation off the manifold is just not well-behaved.

@timholy
Copy link
Contributor

timholy commented Jun 17, 2017

It can differ because you can force the linesearch to stay within the manifold, instead of searching the full space but relaxing to the manifold via a constraint.

That's well-known to be an inefficient approach for most problems, because it forces you to take tiny steps or perform second-order correction to high precision. Of course there are cases where you can't evaluate your objective off the manifold, so there is a place for this, but it should be understood that we'll get much bigger benefit from solving the general problem. Of course, projection methods are pretty simple in some ways, and so getting something simple merged might be a fast way to get some of the benefits of #303.

@ChrisRackauckas
Copy link
Contributor

ChrisRackauckas commented Jun 17, 2017

In DiffEq the callbacks are powerful enough that these kinds of projection methods can be just implemented as callbacks, and then there's a library of standard callback methods which includes a projection method. Maybe that's a good approach for Optim as well: instead of implementing a projection directly, allow the user to inject a function after each update, and have an example where this performs a manifold projection. This same interface can then serve a lot of other functions as well.

@cortner
Copy link
Contributor

cortner commented Jun 17, 2017

Eigenvalue Problems and Harmonic Maps are examples where staying on the manifold tends to be advantageous I think.

@cortner
Copy link
Contributor

cortner commented Jun 17, 2017

But in general I agree with @timholy. It's just that some of these special cases are incredibly important.

@antoine-levitt
Copy link
Contributor Author

There's a report on ROPTLIB at http://www.math.fsu.edu/~whuang2/pdf/ROPTLIB_11WH_techrep.pdf that's recent and has references to the literature. The basic idea is that if you want to minimize a function defined on a sphere, instead of going in the direction of the gradient, you project the gradient on the tangent space to the sphere at the point you're at, do a step, and then map ("retract") back to the sphere. There are subtleties in extending more complex algorithms like CG and BFGS (how do you combine tangent vectors at different points on the manifold? what hessian should BFGS approximate?) and a whole lot of manifolds and variants. The sphere example is trivial, but the main interest in applications (at least from my perspective) is minimizing functionals on the space of matrices with orthogonal columns (the Stiefel manifold).

Compared to general constrained optimization, the benefit is that, when the manifold is simple enough that you can do the projection analytically, you can basically treat it as an unconstrained optimization problem and re-use the same algorithms. The typical use-case would be conjugate gradient or BFGS with orthogonality constraints, and I would imagine it's pretty hard to beat this kind of methods with a general constrained optimization algorithm (but I don't have a good benchmark)

@timholy
Copy link
Contributor

timholy commented Jun 17, 2017

Yes, if you can do analytic projection then this is a good strategy. I'd say go for it; @ChrisRackauckas is right about thinking about this primarily in terms of user-supplied functions for performing the projection.

@pkofod
Copy link
Member

pkofod commented Jun 17, 2017

Ah, okay! Well, I still think this will be a change that doesn't hurt anything, and if someone stands to gain by explicitly handling the "manifold constraint" using an analytical projection, then let's go for it.

@antoine-levitt
Copy link
Contributor Author

I had a look at implementing this and it is slightly more tricky than I thought it would be. In particular, line searches need to be aware of this because the line search is not on f(x + alpha d) but R(f(x+alpha d)), where R is a retraction (projection on the manifold). I've thought of a few ways of doing this:

  • have a "point on a manifold" and "tangent vectors" types defined, with operations like x + d allowed (and doing a retraction implicitly) but not x+y, where x and y are points and d is a tangent vector. Pro: looks very nice, in principle Optim does not need to know anything about manifolds. Con: not compatible with things like
    @simd for i in 1:n
        @inbounds x_scratch[i] = x[i] + alpha * s[i]
    end

or x .+= alpha.*s

  • make optim and linesearches aware of manifolds, and insert project_tangent() and retract() functions everywhere. Pro: possibly the simplest to implement Con: very verbose, need to modify each solver and linesearch, need to put manifolds in NLSolversBase, might introduce tricky bugs if I forget one call, puts an extra burden on all future code
  • rewrite the linesearch API to be more like a 1D search: instead of accepting a general multidimensional objective function, accept a 1D objective function. Pro: much cleaner, simplifies the job of linesearches. Con: needs major refactoring to LineSearches
  • fake an objective function that does retraction, and pass that to the linesearches. Pro: no need to modify NLSolversBase or LineSearches Con: very ugly, possibly some overhead

Thoughts?

@timholy
Copy link
Contributor

timholy commented Jun 20, 2017

rewrite the linesearch API to be more like a 1D search

This should happen anyway, so even if painful seems unquestionably the way to go.

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

the @simd loops need to go as well and be replaced by broadcasts - see also #399.

Without further thinking about it, I quite like the first approach. This would only be feasible when the projections are relatively cheap anyhow and otherwise one would switch to standard constrained optimisation I think.

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

fake an objective function that does retraction

I don't like this; my current experiences with the string method suggest this will be awful

make optim and linesearches aware of manifolds

agree this will be bug-prone.

@anriseth
Copy link
Contributor

rewrite the linesearch API to be more like a 1D search

This should happen anyway, so even if painful seems unquestionably the way to go.

Are there any line search algorithms out there where this would not work? If not, then I agree that this makes sense. I would make sure evaluations of the derivative based on the gradient of the objective is propagated back to Optim somehow (don't think that's very difficult).

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

the only case I can think of immediately that needs thinking about - and I am not even sure this is an exception - is non-monotone linesearch because it depends on the history of the optimisation. Does anybody have experience with that?

@antoine-levitt
Copy link
Contributor Author

I don't think non-monotone linesearch fits into the current API either, it's pretty much orthogonal. As far as I can tell line search is exactly a 1D (approximate) minimization problem, and the linesearches redefine the 1D function (e.g. linefunc! in hagerzhang), so it would make more sense to just pass it a function (and the function can include some machinery to store the gradient evaluations, akin to what's done now).

About a manifold type, I see how it would work with updates like x = x + alpha * d, and even for x .+= d (one would just need to overload broadcast for +, pass it to the data inside x and d, and retract back), but how to do it for something like x .+= alpha.*d1 .+ d2? broadcast gets passed an arbitrary function, which would be hard to analyze.

If that can be done it may not be a bad idea... there would just be a type for a point on a manifold, a type for a tangent vector (that stores its base point so it can do vector transport if needed), and the only thing Optim needs to know is that it should project the gradient immediately after computation, and that it should only do "legitimate" manifold stuff (ie adds tangent vectors to points, vecdot() tangent vectors between themselves, etc). The user would just do optimize(obj_function, gradient_function!, Stiefel(input_data)) and everything would just work. Basically #399 on steroids: optimization doesn't even need a Hilbert space, it only needs a Riemann manifold with a retraction (x + d, where x is a point on the manifold and d a tangent vector to the manifold at x), a vector transport (dx + dy, where dx is a tangent vector at x and dy a tangent vector at y) and a projection on the tangent space.

Possibly the same interface can even be reused directly for DiffEq to solve differential equations on manifolds automatically, without callbacks. If that works, it would be pretty awesome ;-)

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

About a manifold type, I see how it would work with updates like x = x + alpha * d, and even for x .+= d (one would just need to overload broadcast for +, pass it to the data inside x and d, and retract back), but how to do it for something like x .+= alpha.*d1 .+ d2? broadcast gets passed an arbitrary function, which would be hard to analyze

d1, d2 are vectors in the tangent space. alpha a scalar. So alpha * d1 + d2 is just standard addition. x on the other hand is a point on the manifold. so x + d is the special addition. One just needs to distinguish points from vectors

@antoine-levitt
Copy link
Contributor Author

antoine-levitt commented Jun 20, 2017

That's right, which is why x += alpha*d1 + d2 would be trivial to implement (overload scalar*vector, vector+vector, point+vector), my concern is with broadcast: how do you do that when it gets lowered to broadcast(some_fun, x, alpha, d1, d2)?

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

Basically #399 on steroids: optimization doesn't even need a Hilbert space, it only needs a Riemann manifold

I like that. I think we should implement a prototype in a separate branch.

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

That's right, which is why x += alphad1 + d2 would be trivial to implement (overload scalarvector, vector+vector, point+vector), my concern is with broadcast: how do you do that when it gets lowered to broadcast(some_fun, x, alpha, d1, d2)?

I see - I misunderstood. For what it's worth I always thought doing everything in-place in those operations is overkill. But maybe I am wrong, maybe worth thinking about more carefully.

@johnmyleswhite
Copy link
Contributor

I see - I misunderstood. For what it's worth I always thought doing everything in-place in those operations is overkill. But maybe I am wrong, maybe worth thinking about more carefully.

The original code didn't do things in-place and was noticeably slower. That was long ago in Julia history, but I'd want to see benchmarks before removing that optimization.

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

Then I suspect that the objectives I am used to are just significantly more costly than the ones you used for the benchmarks.

@cortner
Copy link
Contributor

cortner commented Jun 20, 2017

Anyhow why can't one dispatch the broadcast on the types of the arguments?

@johnmyleswhite
Copy link
Contributor

Anyhow why can't one dispatch the broadcast on the types of the arguments?

Indeed, using multiple dispatch for optimizations is a large benefit of lowering dot-syntax to broadcast.

@pkofod
Copy link
Member

pkofod commented Jun 20, 2017

The problems that we have in UnconstrainedProblems are so cheap to evaluate that inplace updating makes a difference. Some of the larger problems in CUTEst will take much longer to evaluate the objective and gradients than updating the current iterate. However, I do think that we need to cater to both needs, as I suspect that Optim is used to optimize "cheap" functions repeatedly quite a lot, so there it would be a regression to remove inplace operations, even if only for updating the iterate.

@pkofod
Copy link
Member

pkofod commented Sep 27, 2017

Thanks for this, if someone feels like some of the discussion should be continued, please open new issues for each topic (discussion seemed to go a bit off-topic)

@pkofod pkofod closed this as completed Sep 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants