-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L-BFGS: Use scaled initial guess for inverse Hessian #484
Conversation
We should probably benchmark the scaling vs. no scaling. |
Codecov Report
@@ Coverage Diff @@
## master #484 +/- ##
==========================================
+ Coverage 90.03% 90.07% +0.04%
==========================================
Files 35 35
Lines 1796 1804 +8
==========================================
+ Hits 1617 1625 +8
Misses 179 179
Continue to review full report at Codecov.
|
sidx = view(dx_history, :, idx) # TODO: should these use devec_fun? | ||
yidx = view(dg_history, :, idx) # TODO: should these use devec_fun? | ||
scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot? | ||
s .= scaling*q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s .= scaling.*q
to avoid allocating a temporary for scaling*q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :)
extrapolation step used to determine the initial step-size for the line search | ||
step. The `snap2one` tuple provides boundaries to an interval in which the | ||
initial step-size should be mapped to the unit step-size. | ||
Note that `extrapolate=true` has higher priority than `resetalpha=true` (where resetalpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "precedence" a more precise word than "priority"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
Cool stuff . Curious to see if it makes a difference! |
Good to go. I'm also curious to see whether it makes a big difference or not. We need to get the benchmarking functionality up and running :p The documentation on L-BFGS will have to be changed when we implement JuliaNLSolvers/LineSearches.jl#70, as the options related to alphaguess will be removed. |
end | ||
|
||
# Copy q into s for forward pass | ||
# apply preconditioner if precon != nothing | ||
# (Note: preconditioner update was done outside of this function) | ||
A_ldiv_B!(devec_fun(s), precon, devec_fun(q)) | ||
|
||
if scaleinvH == true && typeof(precon) <: Void && upper > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never studied this re-scaling. Do I understand correctly that this possibly chooses a different scaling factor at each iteration?
What is the reason to not apply the scaling when you have a preconditioner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this possibly chooses a different scaling factor at each iteration?
Yes, it may change at each iteration. There's some discussion about it in N+W around eqns 6.21 and 7.20 (2nd ed). A thorough discussion about this, and other(?), scalings can be found in Gilbert and Lemaréchal - Some numerical experiments with variable-storage quasi-Newton algorithms
What is the reason to not apply the scaling when you have a preconditioner?
If someone implements a preconditioner which is the same as this scaling, then you'd be applying it twice. I originally looked at implementing the scaling in the form of a preconditioner that uses I
, but it was quite troublesome with the current API for updating preconditioners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but suppose I have a preconditioned that gets the metric right, but not the scaling? Specifically, \kappa( P^{-1/2} H P^{-1/2} ) = O(1) but \sigma( P^{-1} H) \approx s with s far from 1. Then I can play some tricks to rescale it, but if LBFGS does it for me, then I'll be very happy! (As a matter of fact I have just such a situation.)
So my point is, would it be possible to have scaling turned off by default if a preconditioner is used, but allow the user to turn it on? E.g., in the LBFGS type construction it could just be a kwarg that check whether a preconditioner is set or not.
Ah, yes good point. I'll make the change.
…On Wed, 8 Nov 2017, 15:00 Christoph Ortner, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/multivariate/solvers/first_order/l_bfgs.jl
<#484 (comment)>
:
> end
# Copy q into s for forward pass
# apply preconditioner if precon != nothing
# (Note: preconditioner update was done outside of this function)
- A_ldiv_B!(devec_fun(s), precon, devec_fun(q))
-
+ if scaleinvH == true && typeof(precon) <: Void && upper > 0
but suppose I have a preconditioned that gets the metric right, but not
the scaling? Specifically, \kappa( P^{-1/2} H P^{-1/2} ) = O(1) but \sigma(
P^{-1} H) \approx s with s far from 1. Then I can play some tricks to
rescale it, but if LBFGS does it for me, then I'll be very happy! (As a
matter of fact I have just such a situation.)
So my point is, would it be possible to have scaling turned off by default
if a preconditioner is used, but allow the user to turn it on? E.g., in the
LBFGS type construction it could just be a kwarg that check whether a
preconditioner is set or not.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#484 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFV6DVdT2mkJPDrWjLSAzL9SoIsO8lH3ks5s0cIOgaJpZM4QOedC>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready, modulo
- my two questions on
view
anddevec_fun
, and - benchmarks to determine whether this should be default
true
orfalse
.
@simd for j in 1:n | ||
@inbounds q[j] -= alpha[i] * dg_history[j, i] | ||
end | ||
dgi = view(dg_history, :, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a view
useful at all?, or can I just do q .-= alpha[i] .* dg_history[:,i]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're accessing memory in the right order here, I don't think this should be a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I keep it as is, or go back to q .-= alpha[i] .* dg_history[:,i]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a bit difficult to see in my phone but I'll have a look later. Doubt this is going to be a bottleneck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the vecdot is correct. update_state! and twoloop! only ever see incoming real vectors (the real_to_complex are there for communication with the rest of the world, e.g. the preconditioner that expects a complex array). You can merge this line with the next one? The view is correct in the sense that it will not allocate any new array (as opposed to without view). Speed wise, I'm not sure, but it's supposed to improve in the future so I'd say keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you for the input.
You can merge this line with the next one?
True, however, I feel that it is more readable this way (I've changed them all to be consistent locally).
sidx = view(dx_history, :, idx) # TODO: should these use devec_fun? | ||
yidx = view(dg_history, :, idx) # TODO: should these use devec_fun? | ||
scaling = dot(sidx, yidx) / sum(abs2, yidx) # TODO: Should this be vecdot? | ||
@. s = scaling*q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoine-levitt Do I need to do devec_fun
on s
and q
here?
@simd for j in 1:n | ||
@inbounds s[j] += dx_history[j, i] * (alpha[i] - beta) | ||
end | ||
dxi = view(dx_history, :, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question on view
as above.
I've compared LBFGS with and without The below figure shows a scatter plot with different performance measures for LBFGS with combinations of ( |
This is very useful - thank you! |
I made some small updates. This is ready from my side. |
Just tested this on my problem because the benchmarks you have converge pretty quickly (also they have non-integer f/g calls?). I require pretty tight tolerances in my application, 1e-6 on the gradient
So slight degradation for HagerZhang, large improvement for backtracking. I'm happy to do more tests if you like. |
@@ -79,31 +96,38 @@ alphaguess = LineSearches.InitialStatic(), | |||
linesearch = LineSearches.HagerZhang(), | |||
P=nothing, | |||
precondprep = (P, x) -> nothing, | |||
manifold = Flat()) | |||
manifold = Flat(), | |||
scaleinvH0::Bool = true && (T <: Void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this T<:Void?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. It should be P::T
above, i.e., we default to true
if P=nothing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not very intuitive, I'd expect T to be the ambiant scalar type. Why not true && typeof(P) <: Void
as in the other place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's better.
Thank you for running more tests :) It is definitely valuable. |
This is ready to merge from my side. |
I look forward to trying this out! |
what is "your problem" (ELI5) just for reference? I'll try it out in a package of mine (the optimization part consists of some maximum likelihood problems) |
Would love to hear back from all of you @anriseth @cortner @antoine-levitt if you do experiment with this. I'll try it out myself for sure. |
Essentially, minimizing a Dirichlet energy of a (complex, matrix-valued) function with some orthogonality constraints (so you see what those PRs were about!), the target application being the computation of Wannier functions in solids. The paper is in preparation so the code is not public yet but it should be pretty soon (and I can send it privately to anyone interested). The problem is not very well-conditioned (I'm not preconditioning, mainly because I'm lazy) and the problem is really the local convergence (I'm already starting in a good place, and just optimizing to really get to that local minimum to high precision). I use LBFGS with a lot of history (100). In particular for linesearches I would expect that the functional is very close to quadratic for most of the optimization. In my previous tests I found LBFGS with Hager-Zhang gave the best results, but with this patch it seems to be backtracking. |
I've attached the performance profiles of the test problems described in this paper. Using Tolerance is set to |
If a preconditioner is not used, I propose that we default to the scaled initial guess for the inverse Hessian as given in Nocedal and Wright. (See 6.21 or 7.20 in the second edition).
I have also included a resetalpha setting for L-BFGS. It should probably be default for all line searches that
alphaguess!
setsstate.alpha=1
. None of the algorithms I know of uses the previous step length as an initial guess anyway. (Maybe for case of Momentum and AcceleratedGradientDescent; but they need a revamp ofalphaguess!
to work properly anyway)And as the state is now, BackTracking will perform terribly, as it can only decrease step lengths.