-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply optimizations for handling of condition checks #32123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me as a tactical fix for the v1.3 series. Thanks for digging into this!
I left some notes inline about things that are more architectural, so I expect we'll ignore them for the purposes of merging this PR but I'd like to consider them for follow-up improvements in the v1.4 branch to hopefully have a couple less potential hazards for future maintenance.
func (n *NodeAbstractResourceInstance) apply( | ||
ctx EvalContext, | ||
state *states.ResourceInstanceObject, | ||
change *plans.ResourceInstanceChange, | ||
applyConfig *configs.Resource, | ||
createBeforeDestroy bool) (*states.ResourceInstanceObject, instances.RepetitionData, tfdiags.Diagnostics) { | ||
keyData instances.RepetitionData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically anything which has access to the EvalContext
can get the "repetition data" like this:
exp := ctx.InstanceExpander()
keyData := exp.GetResourceInstanceRepetitionData(n.Addr())
In the original design for this "instance expander" thing it was intended to act as the source of record for all of this expansion-related information specifically so that we wouldn't need to explicitly pass all of this contextual data around.
I do feel in two minds about it because it feels like a violation of Law of Demeter, but we already widely use ctx
as a big bag of shared "global" context, including in this very function where it uses ctx
to do various other things that aren't strictly this function's business to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I left this as close to the current code as possible. I did forget that the rep data was directly available though, because the existing apply code already re-evaluated it anyway. Switching out that code however runs into a no expansion has been registered
panic, so I think there's something else going on here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes I suppose the fact that we've been passing it around through arguments meant that there was not previously any need to actually register it at the source of record. Unfortunate but not unexpected.
Hopefully in a future change we can find a suitable central point to evaluate this just once and register it in the expander, so it's clearer exactly whose responsibility it is to evaluate the for_each
or count
expressions.
@@ -69,7 +93,7 @@ func (t *DiffTransformer) Transform(g *Graph) error { | |||
// run any condition checks associated with the object, to | |||
// make sure that they still hold when considering the | |||
// results of other changes. | |||
update = true | |||
update = t.hasConfigConditions(addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel weird about making the DiffTransformer specifically aware of the preconditions in the config; this feels like some action at a distance that's likely to trip us up under future maintenance just has how not having no-op actions in the graph caused us to ship a subtly broken postconditions feature in the first place. 😖
This very direct approach does seem reasonable as a small, surgical fix for backporting into the v1.3 branch, but I wonder if there's a different way we could arrange this in the v1.4 branch (soon after this is merged) so that this will hopefully be less of a maintenance hazard.
One starting idea, though I've not thought deeply about it yet:
-
Add yet another optional interface that graph nodes can implement:
type GraphNodeDuringApply interface { HasApplyTimeActions() bool }
-
In a new transformer after we've done all the ones that insert new nodes, prune from the graph any node which implements this interface and
HasApplyTimeActions()
returnsfalse
. -
If even inserting them in the first place is problematic then we could optionally optimize this by having
DiffTransformer
pre-check for the interface itself and not even insert the node in the first place. But my hope is that this wouldn't be necessary because we could do the pruning early enough that it's done by the time we get into all of the expensive transformers that do edge analysis.
Then this "are there any conditions?" logic could live in GraphNodeApplyableResourceInstance
(and similar), closer to the Execute
function that it's acting as a gate for. It would presumably return false
unless the action is something other than NoOp
or there's at least one condition to check.
Since I'm explicitly not suggesting we do this refactoring as a blocker for this PR, happy to discuss this more in a more appropriate place after this is merged. 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few different approaches here, and settled on this config reference only meant as a stop-gap. I think since we do store the plan check results, the results of those could be included in the "diff" that we're handling here. I was running into problems with inconsistent handling of the results however, because those have separate codepaths from normal changes, and this turned out to be a safer and more reliable failsafe for now. Since unknown checks from planning also would indicate "HasApplyTimeActions", it may be worth seeing if we can incorporate the checks more closely with the plan object so they are always in sync.
If there are no changes, then there is no reason to create an apply graph since all objects are known. We however do need the walk to match the expected state structure. This is probably only cleanup of empty nested modules and outputs, but some investigation is needed before making the full change. For now we can store the checks from the plan directly into the new state, since the apply walk overwrote the results we had already.
Ensure that empty check results are normalized in state serialization to prevent unexpected state changes from being written. Because there is no consistent empty, null and omit_empty usage for state structs, there's no good way to create a test which will fail for future additions.
ONly add NoOp changes to the apply graph if they have conditions which need to be evaluated.
We need to avoid re-writing the state for every NoOp apply. We may still be evaluating the instance to account for any side-effects in the condition checks, however the state of the instance has not changes. Re-writing the state is a non-current operation, which may require encoding a fairly large instance state and re-serializing the entire state blob, so it is best avoided if possible.
3bbf3d7
to
efd7715
Compare
The earlier branches were no longer needed for this patch, so I rebased this on main in order to merge it for the release. |
Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch. |
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions. |
The handling of conditions during apply can have adverse affects on very large configs, due to the handling of all existing instances even when there are no changes. With extremely large configurations, the execution time can go from a few minutes to many hours only for minor changes.
The first step is to ensure empty
CheckResults
in the state are normalized to prevent the state from being re-written when there are no semantic differences. Empty and nil slices are not handled consistently overall, but we will use an empty slice in the state zero value to match the other top-level structures.Next we can skip adding NoOp changes to the apply graph when there are no conditions at all to evaluate from the configuration. The nodes themselves can have significant processing overhead when assembling the graph, especially for steps like recalculating individual instance dependencies.
Finally we need to avoid re-planning each instance and re-writing the state for every NoOp change during apply. The whole process has gotten more convoluted with the addition of condition checks, as preconditions are done during the
NodeAbstractResourceInstance.plan
method, but postconditions are handled outside of the abstract instance in theNodeApplyableResourceInstance.managedResourceExecute
method, and repetition data needs to be shared between all these. I pulled the repetition data out ofapply
to simplify the early return in the short-term, but there is definitely some more refactoring to be done in this area.Fixes #32060
Fixes #32071