Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Don't try to register separate checkable resources for each module instance #31860

Merged
merged 3 commits into from
Sep 26, 2022

Conversation

apparentlymart
Copy link
Contributor

@apparentlymart apparentlymart commented Sep 23, 2022

Over in #31846 we learned that the logic for registering the "checkable object instances" for resources with custom condition checks was running once per containing module instance, rather than once total to cover all instances of a resource across all module instances.

The checks infrastructure expects to be told the instances for each checkable object exactly once so that we can ensure it always either has complete information or no information. This is part of how we guarantee that we'll either always return the full set of instances for a checkable object or mark the check as "error" so that check-presenting UIs can produce a consistent set of instances on each run.

Unfortunately our previous approach to handling the two-level expansion of resources inside modules was to use two levels of DynamicExpand, which is a special inversion-of-control technique we use as part of the concurrent graph walk. That meant that there was no place in the code which had access to all of the instances of a particular resource at once.

It turns out that the double-DynamicExpand isn't really necessary here, and was largely just a convenience when we were implementing multi-instance modules to avoid disrupting the existing resource expansion logic too much. The module expansion logic is all just local computation anyway, and it was holding a lock on the state for its duration, so it was getting no material benefit from being called concurrently by the graph walker.

To correct the bug, this reworks the logic to just be one big DynamicExpand that uses normal straight-through code to gather all of the instances of a resource together into a dynamic subgraph, and register them with the checks system. This makes the control flow clearer while also ensuring that by the end of this function we know exactly which set of resource instances we're expecting.


I had originally intended to find a less invasive fix for this for backporting into the v1.3 branch to fix the bug, but the graph walker inversion of control meant that before there was no reasonable correct place to call ReportCheckableObjects objects from.

To minimize the risk of doing such refactoring I made an effort to modify the actual logic as little as possible: what were previously the Execute and DynamicExpand methods of NodePlannableResource are now nodeExpandPlannableResource.expandResourceInstances and nodeExpandPlannableResource.resourceInstanceSubgraph respectively, with them chained together as direct method calls rather than indirectly chained using an intermediate dynamic subgraph.

Moving some logic that was previously in an Execute method to instead be in a DynamicExpand method exposed a logic error in the graph walker's handling of DynamicExpand which I corrected here in the initial commit.

This refactoring did not regress any existing tests except the ones that were directly testing NodePlannableResource, which are now removed altogether because that graph node type no longer exists. I added a new test to cover the resource-related portion of #31846.

Issue #31846 seems to actually be capturing two separate but related bugs, where the other is about output values. This PR does not attempt to address the output value bug, since its root cause seems unrelated and so I intend to tackle it as a separate PR.

We were previously _trying_ to handle diagnostics here but were not quite
doing it right because we were testing whether the resulting error was
nil rather than appending it to the diagnostics and then seeing if the
result has errors.

The difference here is important because it allows DynamicExpand to return
warnings without associated errors when needed. Previously the graph
walker would treat a warnings-only result as if it were an error.

Ideally we'd change DynamicExpand to return diagnostics directly, but we
previously decided against that because there were so many implementors
to update, and my intent for this change is to be surgical in the update
so we minimize risk of backporting the change into patch releases.
@apparentlymart apparentlymart added bug core 1.3-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged labels Sep 23, 2022
@apparentlymart apparentlymart added this to the v1.3.1 milestone Sep 23, 2022
@apparentlymart apparentlymart self-assigned this Sep 23, 2022
We previously did two levels of DynamicExpand to go from ConfigResource to
AbsResource and then from AbsResource to AbsResourceInstance.

We'll now do the full expansion from ConfigResource to AbsResourceInstance
in a single DynamicExpand step inside nodeExpandPlannableResource.

The new approach is essentially functionally equivalent to the old except
that it fixes a bug in the previous implementation: we will now call
checkState.ReportCheckableObjects only once for the entire set of
instances for a particular resource, which is what the checkable objects
infrastructure expects so that it can always mention all of the checkable
objects in the check report even if we bail out partway through due to
a downstream error.

This is essentially the same code but now turned into additional methods
on nodeExpandPlannableResource instead of having the extra graph node
type. This has the further advantage of this now being straight-through
code with standard control flow, instead of the unusual inversion of
control we were doing before bouncing in and out of different Execute and
DynamicExpand implementations to get this done.

func (n *NodePlannableResource) Name() string {
return n.Addr.String()
return &g, diags.ErrWithWarnings()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the process here combines multiple distinct graphs, the graph here often won't have a single root. The single root probably doesn't matter in practice, that is vestigial from very early Terraform, but it does show that we're not actually validating the graph return from DynamicExpand. These have always been very simple and had no way to introduce cycles, but maybe since you're already touching the DynamicExpand call site it might be worth mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm indeed... I typically think of these DynamicExpand graphs as just an over-engineered way to iterate over the multiple expanded items, but you're right that the surrounding infrastructure at least thinks it's dealing with single-rooted acyclic graphs even though these graphs never contain any real dependency edges in practice.

I'm going to first see if there's a straightforward way to add an additional root node just to maintain the contract, but if that isn't easy and low-risk then I'll find a suitable place to add a comment describing this hazard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these have definitely been reduced to an obfuscated iteration, and in other cases we've removed the graph builder entirely. It does still have to get fed into the graph walker for concurrent operations, but maybe in the future we get rid of these last few specialized DynamicExpand graph transformers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While investigating this I noticed two interesting things:

  • Because the test case I added here has only one instance of one resource in the module, the root transformer didn't add an extra root node anyway because that single resource instance was already the root.
  • After adding another resource to force the addition of a root, I noticed that we do actually still end up with only one root because all of the root nodes are equal to one another as currently implemented: the node type is an empty struct and we use a value of that type (rather than a pointer) as the node. Therefore it naturally coalesces together into a single node anyway.

With those two observations in mind, I'm going to add a comment to the RootTransformer that records that all root node values being equal to each other is part of its contract and leave it at that for now. If we want to do something more involved later on then of course we can, but the current code is generating a valid graph already and I don't want to add more complexity to it just for a bug fix.

Copy link
Member

@jbardin jbardin Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple levels of expansion, you still end up with multiple roots (I verified by calling Validate on the expanded subgraphs, which would error with "multiple roots"). I'm not saying this is necessarily something which needs to be fixed here, but we definitely don't always get a single root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting... I wonder why that isn't reproducing on my end. It's working okay for me with the test case updated to generate multiple instances in each module. 🤔

I'll poke at it some more and see if I can get it to fail in the way you're describing or explain why it doesn't before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see what the problem is: it's actually related to the first observation I made above.

If there is only one resource instance in a particular module instance then RootTransformer won't insert a new root node at all, because there's already only one node in the graph. But that node isn't a root node so when we subsume the graphs all together we end up with just a pile of resource instances and no root, which therefore fails validation.

I managed to defeat my own reproduction by already having changed the test to generate multiple resource instances per resource before I tried calling .Validate on the result, and therefore masked the problem. 🤦

It does seem to work regardless and so I'm going to go ahead with this change as a minimal v1.3 backport but I'll open a follow-up PR (intended only for the main branch, not backporting) to actually validate the DynamicExpand results and then make sure it does indeed produce valid graphs in all cases.

We use a non-pointer value for this particular node, which means that
there can never be two root nodes in the same graph: the graph
implementation will just coalesce them together when a second one is added.

Our resource expansion code is relying on that coalescing so that it can
subsume together multiple graphs for different modules instances into a
single mega-graph with all instances across all module instances, with
any root nodes coalescing together to produce a single root.

This also updates one of the context tests that exercises resource
expansion so that it will generate multiple resource instance nodes per
module and thus potentially have multiple roots to coalesce together.
However, we aren't currently explicitly validating the return values from
DynamicExpand and so this test doesn't actually fail if the coalescing
doesn't happen. We may choose to validate the DynamicExpand result in a
later commit in order to make it more obvious if future modifications fail
to uphold this invariant.
@eczy-tempus
Copy link

This branch is confirmed working in several locations where we were previously seeing duplicate checkable objects report... errors for resources. Thank you for addressing this so quickly!

@github-actions
Copy link
Contributor

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

mariuskiessling added a commit to metro-digital/terraform-google-cf-projectcfg that referenced this pull request Sep 29, 2022
This change is required because Terraform v1.3.0 was affected by a bug
that caused the module to crash Terraform (see
hashicorp/terraform#31860 for more details).

In addition, we now finally have a version constraint for the bucket
module.
mariuskiessling added a commit to metro-digital/terraform-google-cf-projectcfg that referenced this pull request Sep 29, 2022
This change is required because Terraform v1.3.0 was affected by a bug
that caused the module to crash Terraform (see
hashicorp/terraform#31860 for more details).

In addition, we now finally have a version constraint for the bucket
module.
@github-actions
Copy link
Contributor

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
1.3-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged bug core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants