Transit: Release locks using defer statements #25336

stevendpclark · 2024-02-09T18:23:25Z

Leverage defer statements to Unlock the fetched policy to avoid issues with forgetting to manually Unlock during each return statement

- Leverage defer statements to Unlock the fetched policy to avoid issues with forgetting to manually Unlock during each return statement

github-actions · 2024-02-09T18:25:08Z

Build Results:
All builds succeeded! ✅

github-actions · 2024-02-09T18:32:34Z

CI Results:
All Go tests succeeded! ✅

sgmiller

Seems correct

mju · 2024-02-09T22:02:21Z

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

stevendpclark · 2024-02-09T22:09:13Z

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

Hi @mju, yes, the fix could have been to just add the missing .Unlock() in the spot that you had identified, thanks by the way for all the investigation efforts! The issue with that fix is it would be easy in the future to introduce a similar bug here or in the other endpoints, forgetting to call Unlock(). If you look at the PR that I had introduced the issue, I just moved the check location and missed the newly required call.

This is a little larger of a fix to prevent a similar issue from occurring in the future across different endpoints of forgetting to explicitly call Unlock().

mju · 2024-02-11T17:09:09Z

I assume only the changes to builtin/logical/transit/path_sign_verify.go are for fixing the bug and other changes are only for refactoring?

Hi @mju, yes, the fix could have been to just add the missing .Unlock() in the spot that you had identified, thanks by the way for all the investigation efforts! The issue with that fix is it would be easy in the future to introduce a similar bug here or in the other endpoints, forgetting to call Unlock(). If you look at the PR that I had introduced the issue, I just moved the check location and missed the newly required call.

This is a little larger of a fix to prevent a similar issue from occurring in the future across different endpoints of forgetting to explicitly call Unlock().

I've been thinking about how to phrase my concerns about how read-write locks are used in the transit code and how to make it better over the past few days. Not as a criticism, but I do think it's better if you can spend days or weeks re-organizing the code. It's way too easy for people to make a mistake modifying this code. It's very difficult to have tests that will warn you about the potential mistakes.

The problem, I think, is that the resource management is not owned by one single piece of code and one single thread at runtime. For example, you now have the following in multiple files.

        p, _, err := b.GetPolicy(...)
        ...
	if !b.System().CachingDisabled() {
		p.Lock(false)
	}
	defer p.Unlock()

GetPolicy() can conditionally return with a write lock acquired. and the caller can then conditionally obtain a read lock on its own. As a result, now the caller needs to defer Unlock unconditionally. This leads to some problems.

The code is hard to trace, which leads to brittleness and maintenance problems
The condition in GetPolicy() and the condition in the caller for the write and read locks need to be mutually-exclusive and cover 100% of the cases so the unconditional Unlock() won't cause problem. And there are multiple pieces of the same client code.

It's very easy for a developer to come in and add one single line to cause a deadlock again down the road.

I am not sure if I am qualified to suggest a fix here. I however feel it'd be way easier if in the code
you isolate the access to the keys to a single piece of code. For example if you have a function to which
all the clients who need to access the keys for reads and writes can offload their work, all the locking
operations will be done in that function.

You can also use a queue (a channel would do?) to take the read and write operations, and make sure there
is only one worker goroutine that works on the operations. When there is a write operation, all other tasks wait.

You do kind of have this code structure already as in lock_manager.go. It's just the "queuing" part leaked
to the client code in a very awkward way in the form of locking and unlocking operation. Taking that control
back, the code will become easier to manage, less error-prone.

stevendpclark · 2024-02-12T14:39:21Z

The problem, I think, is that the resource management is not owned by one single piece of code and one single thread at runtime. For example, you now have the following in multiple files.

@mju Completely agreed on the points you have raised. We do have an issue raised for this internally to evaluate how to simplify this code. Obviously for a bugfix, that was going to be backported across stable releases, I didn't want to refactor the code to that degree.

At this point though I can't say when we would get to it, but it certainly is on our radar.

* Transit: Release locks using defer statements - Leverage defer statements to Unlock the fetched policy to avoid issues with forgetting to manually Unlock during each return statement * Add cl

Transit: Release locks using defer statements

6043121

- Leverage defer statements to Unlock the fetched policy to avoid issues with forgetting to manually Unlock during each return statement

stevendpclark added bug Used to indicate a potential bug secret/transit backport/1.13.x labels Feb 9, 2024

stevendpclark added this to the 1.13.14 milestone Feb 9, 2024

stevendpclark self-assigned this Feb 9, 2024

stevendpclark requested a review from a team as a code owner February 9, 2024 18:23

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Feb 9, 2024

Add cl

40330b2

sgmiller approved these changes Feb 9, 2024

View reviewed changes

stevendpclark merged commit 7463055 into main Feb 9, 2024
31 checks passed

stevendpclark deleted the stevendpclark/vault-23960-transit-error-deadlock branch February 9, 2024 19:06

stevendpclark mentioned this pull request Jul 3, 2024

High CPU usage after upgrade to version 1.15.5 #26036

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transit: Release locks using defer statements #25336

Transit: Release locks using defer statements #25336

stevendpclark commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

sgmiller left a comment

mju commented Feb 9, 2024

stevendpclark commented Feb 9, 2024 •

edited

Loading

mju commented Feb 11, 2024

stevendpclark commented Feb 12, 2024

Transit: Release locks using defer statements #25336

Transit: Release locks using defer statements #25336

Conversation

stevendpclark commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

sgmiller left a comment

Choose a reason for hiding this comment

mju commented Feb 9, 2024

stevendpclark commented Feb 9, 2024 • edited Loading

mju commented Feb 11, 2024

stevendpclark commented Feb 12, 2024

stevendpclark commented Feb 9, 2024 •

edited

Loading