L-BFGS: Use scaled initial guess for inverse Hessian #484

anriseth · 2017-11-01T16:42:20Z

If a preconditioner is not used, I propose that we default to the scaled initial guess for the inverse Hessian as given in Nocedal and Wright. (See 6.21 or 7.20 in the second edition).

I have also included a resetalpha setting for L-BFGS. It should probably be default for all line searches that alphaguess! sets state.alpha=1. None of the algorithms I know of uses the previous step length as an initial guess anyway. (Maybe for case of Momentum and AcceleratedGradientDescent; but they need a revamp of alphaguess! to work properly anyway)
And as the state is now, BackTracking will perform terribly, as it can only decrease step lengths.

anriseth · 2017-11-01T16:55:10Z

We should probably benchmark the scaling vs. no scaling.
It probably has the biggest impact on BackTracking, whilst HagerZhang and MoreThuente seem to be living their own lives quite often.

codecov · 2017-11-01T17:30:55Z

Codecov Report

Merging #484 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #484      +/-   ##
==========================================
+ Coverage   90.03%   90.07%   +0.04%     
==========================================
  Files          35       35              
  Lines        1796     1804       +8     
==========================================
+ Hits         1617     1625       +8     
  Misses        179      179

Impacted Files	Coverage Δ
src/multivariate/solvers/first_order/l_bfgs.jl	`98.43% <100%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 441f289...3a34efe. Read the comment docs.

pkofod · 2017-11-02T19:28:42Z

src/multivariate/solvers/first_order/l_bfgs.jl

+        sidx = view(dx_history, :, idx) # TODO: should these use devec_fun?
+        yidx = view(dg_history, :, idx) # TODO: should these use devec_fun?
+        scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot?
+        s .= scaling*q


s .= scaling.*q to avoid allocating a temporary for scaling*q

Good catch :)

pkofod · 2017-11-02T19:30:15Z

src/multivariate/solvers/first_order/l_bfgs.jl

 extrapolation step used to determine the initial step-size for the line search
 step. The `snap2one` tuple provides boundaries to an interval in which the
 initial step-size should be mapped to the unit step-size.
+Note that `extrapolate=true` has higher priority than `resetalpha=true` (where resetalpha


is "precedence" a more precise word than "priority"?

pkofod · 2017-11-02T19:31:28Z

Cool stuff . Curious to see if it makes a difference!

anriseth · 2017-11-03T09:07:01Z

Good to go. I'm also curious to see whether it makes a big difference or not. We need to get the benchmarking functionality up and running :p

The documentation on L-BFGS will have to be changed when we implement JuliaNLSolvers/LineSearches.jl#70, as the options related to alphaguess will be removed.

cortner · 2017-11-08T14:25:01Z

src/multivariate/solvers/first_order/l_bfgs.jl

    end

    # Copy q into s for forward pass
    # apply preconditioner if precon != nothing
    # (Note: preconditioner update was done outside of this function)
-    A_ldiv_B!(devec_fun(s), precon, devec_fun(q))
-
+    if scaleinvH == true && typeof(precon) <: Void && upper > 0


I never studied this re-scaling. Do I understand correctly that this possibly chooses a different scaling factor at each iteration?

What is the reason to not apply the scaling when you have a preconditioner?

Do I understand correctly that this possibly chooses a different scaling factor at each iteration?

Yes, it may change at each iteration. There's some discussion about it in N+W around eqns 6.21 and 7.20 (2nd ed). A thorough discussion about this, and other(?), scalings can be found in Gilbert and Lemaréchal - Some numerical experiments with variable-storage quasi-Newton algorithms

What is the reason to not apply the scaling when you have a preconditioner?

If someone implements a preconditioner which is the same as this scaling, then you'd be applying it twice. I originally looked at implementing the scaling in the form of a preconditioner that uses I, but it was quite troublesome with the current API for updating preconditioners.

but suppose I have a preconditioned that gets the metric right, but not the scaling? Specifically, \kappa( P^{-1/2} H P^{-1/2} ) = O(1) but \sigma( P^{-1} H) \approx s with s far from 1. Then I can play some tricks to rescale it, but if LBFGS does it for me, then I'll be very happy! (As a matter of fact I have just such a situation.)

So my point is, would it be possible to have scaling turned off by default if a preconditioner is used, but allow the user to turn it on? E.g., in the LBFGS type construction it could just be a kwarg that check whether a preconditioner is set or not.

anriseth · 2017-11-08T15:03:22Z

Ah, yes good point. I'll make the change.

…

On Wed, 8 Nov 2017, 15:00 Christoph Ortner, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/multivariate/solvers/first_order/l_bfgs.jl <#484 (comment)> : > end # Copy q into s for forward pass # apply preconditioner if precon != nothing # (Note: preconditioner update was done outside of this function) - A_ldiv_B!(devec_fun(s), precon, devec_fun(q)) - + if scaleinvH == true && typeof(precon) <: Void && upper > 0 but suppose I have a preconditioned that gets the metric right, but not the scaling? Specifically, \kappa( P^{-1/2} H P^{-1/2} ) = O(1) but \sigma( P^{-1} H) \approx s with s far from 1. Then I can play some tricks to rescale it, but if LBFGS does it for me, then I'll be very happy! (As a matter of fact I have just such a situation.) So my point is, would it be possible to have scaling turned off by default if a preconditioner is used, but allow the user to turn it on? E.g., in the LBFGS type construction it could just be a kwarg that check whether a preconditioner is set or not. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#484 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFV6DVdT2mkJPDrWjLSAzL9SoIsO8lH3ks5s0cIOgaJpZM4QOedC> .

anriseth

This is ready, modulo

my two questions on view and devec_fun, and
benchmarks to determine whether this should be default true or false.

anriseth · 2017-11-17T12:01:37Z

src/multivariate/solvers/first_order/l_bfgs.jl

-        @simd for j in 1:n
-            @inbounds q[j] -= alpha[i] * dg_history[j, i]
-        end
+        dgi = view(dg_history, :, i)


Is a view useful at all?, or can I just do q .-= alpha[i] .* dg_history[:,i] ?

since we're accessing memory in the right order here, I don't think this should be a problem

So should I keep it as is, or go back to q .-= alpha[i] .* dg_history[:,i]?

Its a bit difficult to see in my phone but I'll have a look later. Doubt this is going to be a bottleneck

Yes, the vecdot is correct. update_state! and twoloop! only ever see incoming real vectors (the real_to_complex are there for communication with the rest of the world, e.g. the preconditioner that expects a complex array). You can merge this line with the next one? The view is correct in the sense that it will not allocate any new array (as opposed to without view). Speed wise, I'm not sure, but it's supposed to improve in the future so I'd say keep it.

Great, thank you for the input.

You can merge this line with the next one?

True, however, I feel that it is more readable this way (I've changed them all to be consistent locally).

anriseth · 2017-11-17T12:01:40Z

src/multivariate/solvers/first_order/l_bfgs.jl

+        sidx = view(dx_history, :, idx) # TODO: should these use devec_fun?
+        yidx = view(dg_history, :, idx) # TODO: should these use devec_fun?
+        scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot?
+        @. s = scaling*q


@antoine-levitt Do I need to do devec_fun on s and q here?

anriseth · 2017-11-17T12:02:00Z

src/multivariate/solvers/first_order/l_bfgs.jl

-        @simd for j in 1:n
-            @inbounds s[j] += dx_history[j, i] * (alpha[i] - beta)
-        end
+        dxi = view(dx_history, :, i)


Same question on view as above.

anriseth · 2017-11-17T18:33:45Z

I've compared LBFGS with and without scaleinvH0 for the different line searches, and overall it seems like scaleinvH0=true is the best default choice. It also seems like we can get some benefit from defaulting to BT3, but let's make decisions on default linesearch and alphaguesses later.

The below figure shows a scatter plot with different performance measures for LBFGS with combinations of (scaleinvH0,linesearch), with objective to reach a relative tolerance f(x) <= f_min + 1e-10(f_0-f_min). There are 16 test problems, taken from my new branch on OptimTestProblems. The dots at the very top of each plot corresponds to failures.

cortner · 2017-11-18T09:12:56Z

This is very useful - thank you!

anriseth · 2017-11-20T09:23:30Z

I made some small updates. This is ready from my side.

antoine-levitt · 2017-11-20T12:16:18Z

Just tested this on my problem because the benchmarks you have converge pretty quickly (also they have non-integer f/g calls?). I require pretty tight tolerances in my application, 1e-6 on the gradient

|--------+------------+--------+--------|
| method | scaleinvH0 | fcalls | gcalls |
|--------+------------+--------+--------|
| HZ     | true       |   1027 |   1027 |
| HZ     | false      |    866 |    866 |
| BT     | false      |   2103 |    628 |
| BT     | true       |    372 |    364 |
|--------+------------+--------+--------|

So slight degradation for HagerZhang, large improvement for backtracking. I'm happy to do more tests if you like.

antoine-levitt · 2017-11-20T12:25:30Z

src/multivariate/solvers/first_order/l_bfgs.jl

@@ -79,31 +96,38 @@ alphaguess = LineSearches.InitialStatic(),
 linesearch = LineSearches.HagerZhang(),
 P=nothing,
 precondprep = (P, x) -> nothing,
-manifold = Flat())
+manifold = Flat(),
+scaleinvH0::Bool = true && (T <: Void)


What is this T<:Void?

Ah, good catch. It should be P::T above, i.e., we default to true if P=nothing.

That's not very intuitive, I'd expect T to be the ambiant scalar type. Why not true && typeof(P) <: Void as in the other place?

Yes, that's better.

anriseth · 2017-11-20T12:28:37Z

Just tested this on my problem because the benchmarks you have converge pretty quickly (also they have non-integer f/g calls?).

Thank you for running more tests :) It is definitely valuable.
Non-integer values: My plots report performance ratios, which are functions of the number of f-calls, iterations, etc, to normalize across a collection of test problems. Here's a paper that explains the rationale of this approach; https://arxiv.org/pdf/cs/0102001.pdf

anriseth · 2017-11-27T13:13:50Z

This is ready to merge from my side.

cortner · 2017-11-28T14:42:48Z

I look forward to trying this out!

pkofod · 2017-11-28T20:26:29Z

Just tested this on my problem

what is "your problem" (ELI5) just for reference?

I'll try it out in a package of mine (the optimization part consists of some maximum likelihood problems)

pkofod · 2017-11-28T20:27:31Z

Would love to hear back from all of you @anriseth @cortner @antoine-levitt if you do experiment with this. I'll try it out myself for sure.

antoine-levitt · 2017-11-28T20:41:12Z

Essentially, minimizing a Dirichlet energy of a (complex, matrix-valued) function with some orthogonality constraints (so you see what those PRs were about!), the target application being the computation of Wannier functions in solids. The paper is in preparation so the code is not public yet but it should be pretty soon (and I can send it privately to anyone interested). The problem is not very well-conditioned (I'm not preconditioning, mainly because I'm lazy) and the problem is really the local convergence (I'm already starting in a good place, and just optimizing to really get to that local minimum to high precision). I use LBFGS with a lot of history (100). In particular for linesearches I would expect that the functional is very close to quadratic for most of the optimization. In my previous tests I found LBFGS with Hager-Zhang gave the best results, but with this patch it seems to be backtracking.

anriseth · 2017-11-30T14:02:50Z

Would love to hear back from all of you @anriseth @cortner @antoine-levitt if you do experiment with this. I'll try it out myself for sure.

I've attached the performance profiles of the test problems described in this paper. Using scaleinvH0=true helps for all the line searches.

Tolerance is set to \|g(x)\| <= 1e-10 \|g(x_0)\|, and maxiter to 20,000.
The test set contains nonlinear least squares problems of dimensions 100 - 100,000. Each (problem,dimension) pair is run 100 times with randomized initial conditions. (1800 problem instances in total)

pkofod reviewed Nov 2, 2017

View reviewed changes

anriseth mentioned this pull request Nov 7, 2017

Create types for initial step length guess JuliaNLSolvers/LineSearches.jl#70

Merged

2 tasks

cortner reviewed Nov 8, 2017

View reviewed changes

cortner approved these changes Nov 8, 2017

View reviewed changes

anriseth added 7 commits November 17, 2017 10:00

Use dots intsead of simd

0d623cf

Scale initial inverse-Hessiaaccording to Nocedal+Wright

a86f619

Make scaling of initial inverse Hessian guess an option

f0e66ac

Fix missing twoloop parameter

7004e81

Fix dots

e201b59

Allow scaling with preconditioner

5d58829

Add comment on pseudo_iteration

d54e718

anriseth commented Nov 17, 2017

View reviewed changes

Update comments. Type stable(?) rho_iteration

4b484e7

antoine-levitt reviewed Nov 20, 2017

View reviewed changes

anriseth added 2 commits November 20, 2017 12:33

Emphasise T in docstring

7003bcf

Readability and inbounds

3a34efe

pkofod merged commit 7cf790d into JuliaNLSolvers:master Nov 28, 2017

anriseth deleted the lbfgsfixes branch November 28, 2017 22:18

anriseth mentioned this pull request Jan 26, 2018

Allow for AbstractArrays such as ArrayPartition from RecursiveArrayTo… #520

Merged

2 tasks

This was referenced Aug 28, 2018

Use BackTracking (cubic) as default line search for BFGS #663

Closed

Document algorithm choice and derivatives #666

Merged

L-BFGS: Use scaled initial guess for inverse Hessian #484

L-BFGS: Use scaled initial guess for inverse Hessian #484

Conversation

anriseth commented Nov 1, 2017

anriseth commented Nov 1, 2017

codecov bot commented Nov 1, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkofod commented Nov 2, 2017

anriseth commented Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anriseth commented Nov 8, 2017 via email

anriseth left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anriseth Nov 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anriseth commented Nov 17, 2017 • edited Loading

cortner commented Nov 18, 2017

anriseth commented Nov 20, 2017

antoine-levitt commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anriseth commented Nov 20, 2017

anriseth commented Nov 27, 2017

cortner commented Nov 28, 2017

pkofod commented Nov 28, 2017

pkofod commented Nov 28, 2017

antoine-levitt commented Nov 28, 2017 • edited Loading

anriseth commented Nov 30, 2017

codecov bot commented Nov 1, 2017 •

edited

Loading

anriseth commented Nov 3, 2017 •

edited

Loading

anriseth left a comment •

edited

Loading

anriseth Nov 17, 2017 •

edited

Loading

anriseth commented Nov 17, 2017 •

edited

Loading

antoine-levitt commented Nov 28, 2017 •

edited

Loading