Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transit: Release locks using defer statements #25336

Merged

Conversation

stevendpclark
Copy link
Contributor

  • Leverage defer statements to Unlock the fetched policy to avoid issues with forgetting to manually Unlock during each return statement

Fixes: #25325

 - Leverage defer statements to Unlock the fetched policy
   to avoid issues with forgetting to manually Unlock during
   each return statement
@stevendpclark stevendpclark added bug Used to indicate a potential bug secret/transit backport/1.13.x labels Feb 9, 2024
@stevendpclark stevendpclark added this to the 1.13.14 milestone Feb 9, 2024
@stevendpclark stevendpclark self-assigned this Feb 9, 2024
@stevendpclark stevendpclark requested a review from a team as a code owner February 9, 2024 18:23
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Feb 9, 2024
Copy link

github-actions bot commented Feb 9, 2024

Build Results:
All builds succeeded! ✅

Copy link

github-actions bot commented Feb 9, 2024

CI Results:
All Go tests succeeded! ✅

Copy link
Collaborator

@sgmiller sgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems correct

@mju
Copy link

mju commented Feb 9, 2024

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

@stevendpclark
Copy link
Contributor Author

stevendpclark commented Feb 9, 2024

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

Hi @mju, yes, the fix could have been to just add the missing .Unlock() in the spot that you had identified, thanks by the way for all the investigation efforts! The issue with that fix is it would be easy in the future to introduce a similar bug here or in the other endpoints, forgetting to call Unlock(). If you look at the PR that I had introduced the issue, I just moved the check location and missed the newly required call.

This is a little larger of a fix to prevent a similar issue from occurring in the future across different endpoints of forgetting to explicitly call Unlock().

@mju
Copy link

mju commented Feb 11, 2024

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

Hi @mju, yes, the fix could have been to just add the missing .Unlock() in the spot that you had identified, thanks by the way for all the investigation efforts! The issue with that fix is it would be easy in the future to introduce a similar bug here or in the other endpoints, forgetting to call Unlock(). If you look at the PR that I had introduced the issue, I just moved the check location and missed the newly required call.

This is a little larger of a fix to prevent a similar issue from occurring in the future across different endpoints of forgetting to explicitly call Unlock().

I've been thinking about how to phrase my concerns about how read-write locks are used in the transit code and how to make it better over the past few days. Not as a criticism, but I do think it's better if you can spend days or weeks re-organizing the code. It's way too easy for people to make a mistake modifying this code. It's very difficult to have tests that will warn you about the potential mistakes.

The problem, I think, is that the resource management is not owned by one single piece of code and one single thread at runtime. For example, you now have the following in multiple files.

        p, _, err := b.GetPolicy(...)
        ...
	if !b.System().CachingDisabled() {
		p.Lock(false)
	}
	defer p.Unlock()

GetPolicy() can conditionally return with a write lock acquired. and the caller can then conditionally obtain a read lock on its own. As a result, now the caller needs to defer Unlock unconditionally. This leads to some problems.

  1. The code is hard to trace, which leads to brittleness and maintenance problems
  2. The condition in GetPolicy() and the condition in the caller for the write and read locks need to be mutually-exclusive and cover 100% of the cases so the unconditional Unlock() won't cause problem. And there are multiple pieces of the same client code.

It's very easy for a developer to come in and add one single line to cause a deadlock again down the road.

I am not sure if I am qualified to suggest a fix here. I however feel it'd be way easier if in the code
you isolate the access to the keys to a single piece of code. For example if you have a function to which
all the clients who need to access the keys for reads and writes can offload their work, all the locking
operations will be done in that function.

You can also use a queue (a channel would do?) to take the read and write operations, and make sure there
is only one worker goroutine that works on the operations. When there is a write operation, all other tasks wait.

You do kind of have this code structure already as in lock_manager.go. It's just the "queuing" part leaked
to the client code in a very awkward way in the form of locking and unlocking operation. Taking that control
back, the code will become easier to manage, less error-prone.

@stevendpclark
Copy link
Contributor Author

The problem, I think, is that the resource management is not owned by one single piece of code and one single thread at runtime. For example, you now have the following in multiple files.

@mju Completely agreed on the points you have raised. We do have an issue raised for this internally to evaluate how to simplify this code. Obviously for a bugfix, that was going to be backported across stable releases, I didn't want to refactor the code to that degree.

At this point though I can't say when we would get to it, but it certainly is on our radar.

Monkeychip pushed a commit that referenced this pull request Feb 12, 2024
* Transit: Release locks using defer statements

 - Leverage defer statements to Unlock the fetched policy
   to avoid issues with forgetting to manually Unlock during
   each return statement

* Add cl
Monkeychip pushed a commit that referenced this pull request Feb 12, 2024
* Transit: Release locks using defer statements

 - Leverage defer statements to Unlock the fetched policy
   to avoid issues with forgetting to manually Unlock during
   each return statement

* Add cl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed secret/transit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Deadlock] Hanging when reading a transit key
3 participants