Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L-BFGS: Use scaled initial guess for inverse Hessian #484

Merged
merged 10 commits into from
Nov 28, 2017
Merged

L-BFGS: Use scaled initial guess for inverse Hessian #484

merged 10 commits into from
Nov 28, 2017

Conversation

anriseth
Copy link
Contributor

@anriseth anriseth commented Nov 1, 2017

If a preconditioner is not used, I propose that we default to the scaled initial guess for the inverse Hessian as given in Nocedal and Wright. (See 6.21 or 7.20 in the second edition).

I have also included a resetalpha setting for L-BFGS. It should probably be default for all line searches that alphaguess! sets state.alpha=1. None of the algorithms I know of uses the previous step length as an initial guess anyway. (Maybe for case of Momentum and AcceleratedGradientDescent; but they need a revamp of alphaguess! to work properly anyway)
And as the state is now, BackTracking will perform terribly, as it can only decrease step lengths.

@anriseth
Copy link
Contributor Author

anriseth commented Nov 1, 2017

We should probably benchmark the scaling vs. no scaling.
It probably has the biggest impact on BackTracking, whilst HagerZhang and MoreThuente seem to be living their own lives quite often.

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #484 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #484      +/-   ##
==========================================
+ Coverage   90.03%   90.07%   +0.04%     
==========================================
  Files          35       35              
  Lines        1796     1804       +8     
==========================================
+ Hits         1617     1625       +8     
  Misses        179      179
Impacted Files Coverage Δ
src/multivariate/solvers/first_order/l_bfgs.jl 98.43% <100%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 441f289...3a34efe. Read the comment docs.

sidx = view(dx_history, :, idx) # TODO: should these use devec_fun?
yidx = view(dg_history, :, idx) # TODO: should these use devec_fun?
scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot?
s .= scaling*q
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s .= scaling.*q to avoid allocating a temporary for scaling*q

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :)

extrapolation step used to determine the initial step-size for the line search
step. The `snap2one` tuple provides boundaries to an interval in which the
initial step-size should be mapped to the unit step-size.
Note that `extrapolate=true` has higher priority than `resetalpha=true` (where resetalpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "precedence" a more precise word than "priority"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@pkofod
Copy link
Member

pkofod commented Nov 2, 2017

Cool stuff . Curious to see if it makes a difference!

@anriseth
Copy link
Contributor Author

anriseth commented Nov 3, 2017

Good to go. I'm also curious to see whether it makes a big difference or not. We need to get the benchmarking functionality up and running :p

The documentation on L-BFGS will have to be changed when we implement JuliaNLSolvers/LineSearches.jl#70, as the options related to alphaguess will be removed.

end

# Copy q into s for forward pass
# apply preconditioner if precon != nothing
# (Note: preconditioner update was done outside of this function)
A_ldiv_B!(devec_fun(s), precon, devec_fun(q))

if scaleinvH == true && typeof(precon) <: Void && upper > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never studied this re-scaling. Do I understand correctly that this possibly chooses a different scaling factor at each iteration?

What is the reason to not apply the scaling when you have a preconditioner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this possibly chooses a different scaling factor at each iteration?

Yes, it may change at each iteration. There's some discussion about it in N+W around eqns 6.21 and 7.20 (2nd ed). A thorough discussion about this, and other(?), scalings can be found in Gilbert and Lemaréchal - Some numerical experiments with variable-storage quasi-Newton algorithms

What is the reason to not apply the scaling when you have a preconditioner?

If someone implements a preconditioner which is the same as this scaling, then you'd be applying it twice. I originally looked at implementing the scaling in the form of a preconditioner that uses I, but it was quite troublesome with the current API for updating preconditioners.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but suppose I have a preconditioned that gets the metric right, but not the scaling? Specifically, \kappa( P^{-1/2} H P^{-1/2} ) = O(1) but \sigma( P^{-1} H) \approx s with s far from 1. Then I can play some tricks to rescale it, but if LBFGS does it for me, then I'll be very happy! (As a matter of fact I have just such a situation.)

So my point is, would it be possible to have scaling turned off by default if a preconditioner is used, but allow the user to turn it on? E.g., in the LBFGS type construction it could just be a kwarg that check whether a preconditioner is set or not.

@anriseth
Copy link
Contributor Author

anriseth commented Nov 8, 2017 via email

Copy link
Contributor Author

@anriseth anriseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready, modulo

  • my two questions on view and devec_fun, and
  • benchmarks to determine whether this should be default true or false.

@simd for j in 1:n
@inbounds q[j] -= alpha[i] * dg_history[j, i]
end
dgi = view(dg_history, :, i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a view useful at all?, or can I just do q .-= alpha[i] .* dg_history[:,i] ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're accessing memory in the right order here, I don't think this should be a problem

Copy link
Contributor Author

@anriseth anriseth Nov 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I keep it as is, or go back to q .-= alpha[i] .* dg_history[:,i]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit difficult to see in my phone but I'll have a look later. Doubt this is going to be a bottleneck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the vecdot is correct. update_state! and twoloop! only ever see incoming real vectors (the real_to_complex are there for communication with the rest of the world, e.g. the preconditioner that expects a complex array). You can merge this line with the next one? The view is correct in the sense that it will not allocate any new array (as opposed to without view). Speed wise, I'm not sure, but it's supposed to improve in the future so I'd say keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for the input.

You can merge this line with the next one?

True, however, I feel that it is more readable this way (I've changed them all to be consistent locally).

sidx = view(dx_history, :, idx) # TODO: should these use devec_fun?
yidx = view(dg_history, :, idx) # TODO: should these use devec_fun?
scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot?
@. s = scaling*q
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoine-levitt Do I need to do devec_fun on s and q here?

@simd for j in 1:n
@inbounds s[j] += dx_history[j, i] * (alpha[i] - beta)
end
dxi = view(dx_history, :, i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question on view as above.

@anriseth
Copy link
Contributor Author

anriseth commented Nov 17, 2017

I've compared LBFGS with and without scaleinvH0 for the different line searches, and overall it seems like scaleinvH0=true is the best default choice. It also seems like we can get some benefit from defaulting to BT3, but let's make decisions on default linesearch and alphaguesses later.

The below figure shows a scatter plot with different performance measures for LBFGS with combinations of (scaleinvH0,linesearch), with objective to reach a relative tolerance f(x) <= f_min + 1e-10(f_0-f_min). There are 16 test problems, taken from my new branch on OptimTestProblems. The dots at the very top of each plot corresponds to failures.

figure_1

@cortner
Copy link
Contributor

cortner commented Nov 18, 2017

This is very useful - thank you!

@anriseth
Copy link
Contributor Author

I made some small updates. This is ready from my side.

@antoine-levitt
Copy link
Contributor

Just tested this on my problem because the benchmarks you have converge pretty quickly (also they have non-integer f/g calls?). I require pretty tight tolerances in my application, 1e-6 on the gradient

|--------+------------+--------+--------|
| method | scaleinvH0 | fcalls | gcalls |
|--------+------------+--------+--------|
| HZ     | true       |   1027 |   1027 |
| HZ     | false      |    866 |    866 |
| BT     | false      |   2103 |    628 |
| BT     | true       |    372 |    364 |
|--------+------------+--------+--------|

So slight degradation for HagerZhang, large improvement for backtracking. I'm happy to do more tests if you like.

@@ -79,31 +96,38 @@ alphaguess = LineSearches.InitialStatic(),
linesearch = LineSearches.HagerZhang(),
P=nothing,
precondprep = (P, x) -> nothing,
manifold = Flat())
manifold = Flat(),
scaleinvH0::Bool = true && (T <: Void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this T<:Void?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. It should be P::T above, i.e., we default to true if P=nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not very intuitive, I'd expect T to be the ambiant scalar type. Why not true && typeof(P) <: Void as in the other place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.

@anriseth
Copy link
Contributor Author

Just tested this on my problem because the benchmarks you have converge pretty quickly (also they have non-integer f/g calls?).

Thank you for running more tests :) It is definitely valuable.
Non-integer values: My plots report performance ratios, which are functions of the number of f-calls, iterations, etc, to normalize across a collection of test problems. Here's a paper that explains the rationale of this approach; https://arxiv.org/pdf/cs/0102001.pdf

@anriseth
Copy link
Contributor Author

This is ready to merge from my side.

@cortner
Copy link
Contributor

cortner commented Nov 28, 2017

I look forward to trying this out!

@pkofod
Copy link
Member

pkofod commented Nov 28, 2017

Just tested this on my problem

what is "your problem" (ELI5) just for reference?

I'll try it out in a package of mine (the optimization part consists of some maximum likelihood problems)

@pkofod pkofod merged commit 7cf790d into JuliaNLSolvers:master Nov 28, 2017
@pkofod
Copy link
Member

pkofod commented Nov 28, 2017

Would love to hear back from all of you @anriseth @cortner @antoine-levitt if you do experiment with this. I'll try it out myself for sure.

@antoine-levitt
Copy link
Contributor

antoine-levitt commented Nov 28, 2017

Essentially, minimizing a Dirichlet energy of a (complex, matrix-valued) function with some orthogonality constraints (so you see what those PRs were about!), the target application being the computation of Wannier functions in solids. The paper is in preparation so the code is not public yet but it should be pretty soon (and I can send it privately to anyone interested). The problem is not very well-conditioned (I'm not preconditioning, mainly because I'm lazy) and the problem is really the local convergence (I'm already starting in a good place, and just optimizing to really get to that local minimum to high precision). I use LBFGS with a lot of history (100). In particular for linesearches I would expect that the functional is very close to quadratic for most of the optimization. In my previous tests I found LBFGS with Hager-Zhang gave the best results, but with this patch it seems to be backtracking.

@anriseth anriseth deleted the lbfgsfixes branch November 28, 2017 22:18
@anriseth
Copy link
Contributor Author

Would love to hear back from all of you @anriseth @cortner @antoine-levitt if you do experiment with this. I'll try it out myself for sure.

I've attached the performance profiles of the test problems described in this paper. Using scaleinvH0=true helps for all the line searches.
figure_1

Tolerance is set to \|g(x)\| <= 1e-10 \|g(x_0)\|, and maxiter to 20,000.
The test set contains nonlinear least squares problems of dimensions 100 - 100,000. Each (problem,dimension) pair is run 100 times with randomized initial conditions. (1800 problem instances in total)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants