Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add log and metrics to workflow termination events #6146

Conversation

fimanishi
Copy link
Member

@fimanishi fimanishi commented Jun 24, 2024

What changed?
Added logs to workflow termination events, containing the reason for the termination, tagged with the domainName, workflowID, terminationReason, and runID.
Added metrics to workflow termination events, using a counter per domain WorkflowTerminateCounterPerDomain under the HistoryTerminateWorkflowExecutionScope scope.

Why?
Improve workflow termination visibility, allowing Cadence and clients to easily find terminated workflows. This is particularly important to provide better information for workflows terminated during failovers.

How did you test it?
Unit tests.

Potential risks
The risks are associated with the changes in functions parameters being passed. Need to ensure that the parameters are correct and that they do not contain nil values.

Release notes

Documentation Changes

@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch 3 times, most recently from 197331d to f3a7f5d Compare June 25, 2024 05:13
@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 01904dd0-8aa3-4316-a934-ce3147335fd0

Details

  • 63 of 66 (95.45%) changed or added relevant lines in 12 files are covered.
  • 39 unchanged lines in 11 files lost coverage.
  • Overall coverage decreased (-0.04%) to 71.512%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/task/transfer_active_task_executor.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 88.06%
common/peerprovider/ringpopprovider/config.go 2 81.58%
common/task/fifo_task_scheduler.go 2 83.51%
common/util.go 2 91.84%
service/history/queue/transfer_queue_processor.go 2 56.49%
common/persistence/nosql/nosql_task_store.go 3 85.52%
service/history/task/transfer_active_task_executor.go 3 72.63%
common/persistence/statsComputer.go 3 98.21%
service/history/task/transfer_standby_task_executor.go 6 86.94%
service/history/task/fetcher.go 6 85.05%
Totals Coverage Status
Change from base Build 01904bcc-3b56-4d05-8b48-d9bec7a9b7ea: -0.04%
Covered Lines: 106766
Relevant Lines: 149299

💛 - Coveralls

@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch 3 times, most recently from 5329e75 to 08460ec Compare June 25, 2024 18:35
@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 019050ae-cea2-4903-9de4-b1001335191c

Details

  • 65 of 68 (95.59%) changed or added relevant lines in 12 files are covered.
  • 54 unchanged lines in 14 files lost coverage.
  • Overall coverage decreased (-0.03%) to 71.509%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/task/transfer_active_task_executor.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/task/transfer_active_task_executor.go 1 72.76%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
service/matching/tasklist/db.go 2 73.23%
common/task/fifo_task_scheduler.go 2 85.57%
common/util.go 2 91.84%
service/history/task/task.go 3 84.81%
common/log/tag/tags.go 3 50.46%
common/persistence/nosql/nosql_task_store.go 3 85.52%
service/history/task/timer_standby_task_executor.go 3 85.63%
service/history/task/fetcher.go 3 86.6%
Totals Coverage Status
Change from base Build 0190508e-3f7e-4c0d-b546-4eaa2a2045f1: -0.03%
Covered Lines: 106763
Relevant Lines: 149301

💛 - Coveralls

@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch from 08460ec to 2f2f9c7 Compare June 28, 2024 22:33
@coveralls
Copy link

coveralls commented Jun 28, 2024

Pull Request Test Coverage Report for Build 019060fb-a2f5-46ac-a87c-29c4b9ea3cef

Details

  • 64 of 67 (95.52%) changed or added relevant lines in 13 files are covered.
  • 2406 unchanged lines in 53 files lost coverage.
  • Overall coverage decreased (-0.02%) to 71.507%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/task/transfer_active_task_executor.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
common/persistence/sql/sqlplugin/postgres/task.go 2 73.4%
service/matching/tasklist/db.go 2 73.23%
common/dynamicconfig/constants.go 2 99.05%
service/matching/tasklist/task_list_manager.go 2 76.65%
common/dynamicconfig/config.go 2 98.92%
common/util.go 2 91.84%
common/constants.go 2 0.0%
service/history/task/task.go 3 84.81%
common/log/tag/tags.go 3 50.73%
Totals Coverage Status
Change from base Build 019060e3-e556-4061-8a61-c080801be36f: -0.02%
Covered Lines: 106760
Relevant Lines: 149300

💛 - Coveralls

@@ -47,5 +54,19 @@ func TerminateWorkflow(
terminateDetails,
terminateIdentity,
)

domainName := mutableState.GetDomainEntry().GetInfo().Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving these metrics to AddWorkflowExecutionTerminatedEvent. It will simplify things since mutableStateBuilder already has metrics.Client and logger. Then you don't need to change the signature of this method

tag.WorkflowDomainName(domainName),
tag.WorkflowID(r.mutableState.GetExecutionInfo().WorkflowID),
tag.WorkflowRunID(r.mutableState.GetExecutionInfo().RunID),
tag.WorkflowTerminationReason(WorkflowTerminationReason),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code path only called for conflict based terminations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend passing the reason as a parameter because this looks like a helper function that may get called from other places in the future. Having a reason param will force the future changes to pass the actual reason which is desired compared to logging the invalid reason for such cases.

@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch from 8e519a7 to 7aa89ca Compare July 1, 2024 19:24
tag.WorkflowTerminationReason(reason),
)

scopeWithDomainTag := e.metricsClient.Scope(metrics.HistoryTerminateWorkflowExecutionScope).Tagged(metrics.DomainTag(domainName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add the reason dimension to this?

@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch 3 times, most recently from c034d9b to 460c452 Compare July 5, 2024 23:19
**What changed?**
Added logs to workflow termination events, containing the reason for the termination, tagged with the `domainName`, `workflowID`, `terminationReason`, and `runID`.
Added metrics to workflow termination events, using a counter per domain `WorkflowTerminateCounterPerDomain` under the `HistoryTerminateWorkflowExecutionScope` scope, with `WorkflowTerminationReasonTag`

**Why?**
Improve workflow termination visibility, allowing Cadence and clients to easily find terminated workflows. This is particularly important to provide better information for workflows terminated during failovers.

**How did you test it?**
Unit tests.

**Potential risks**
The risks are associated with the changes in functions parameters being passed. Need to ensure that the parameters are correct and that they do not contain `nil` values.

**Release notes**

**Documentation Changes**
@fimanishi fimanishi force-pushed the metrics-for-workflows-affected-by-conflict-resolution branch from 460c452 to e3becf7 Compare July 5, 2024 23:25
Copy link

codecov bot commented Jul 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.67%. Comparing base (731cec6) to head (f796114).
Report is 6 commits behind head on master.

Additional details and impacted files
Files Coverage Δ
service/history/execution/mutable_state_builder.go 61.67% <100.00%> (+0.40%) ⬆️
service/history/execution/workflow.go 65.81% <100.00%> (ø)

... and 14 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc05a78...f796114. Read the comment docs.

tag.WorkflowTerminationReason(reason),
)

re := regexp.MustCompile(`[^a-zA-Z0-9]`)
Copy link
Member

@davidporter-id-au davidporter-id-au Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be annoying, but lets move this outside the function invocation, there's no need to recompile it each time.

By convention, go programs put contstants, type declarations up the top, so perhaps around the top of the metrics/tags.go we can define the regex and then just use it inside the metrics.WorkflowTerminationReasonTag function, thefore we don't need to remember to sanitize inline like this for every use?

The syntax outside function closure is slightly different, so either


var safeAlphaNumericStringRE = regexp.MustCompile(`[^a-zA-Z0-9]`)

// or, within a var block like:
var (
   safeAlphaNumericStringRE = regexp.MustCompile(`[^a-zA-Z0-9]`)
)

and that will make it easier such that we reference the already compiled regex

// WorkflowTerminationReasonTag reports the reason for workflow termination
func WorkflowTerminationReasonTag(value string) Tag {
        value = safeAlphaNumericStringRE.ReplaceAllString(value, "_")
	return simpleMetric{key: workflowTerminationReason, value: value}
}

@fimanishi fimanishi merged commit 3248455 into cadence-workflow:master Jul 11, 2024
20 checks passed
@fimanishi fimanishi deleted the metrics-for-workflows-affected-by-conflict-resolution branch July 11, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants