Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMZA-2444: JobModel save in CoordinatorStreamStore resulting flush for each message #1259

Merged
merged 7 commits into from
Feb 5, 2020

Conversation

alnzng
Copy link
Contributor

@alnzng alnzng commented Jan 23, 2020

Symptom

When Samza's job creates lots of tasks/partitions, it can take over a long time for AM to save the metadata in a run which may cause timeout exception.
We observed if the Samza's job has over 37k tasks/partitions, it takes over 10 min for AM to save it in a run.

Cause

JobModelManager uses CoordinatorStreamStore.put() to save job metadata information which does flush for each message, and the flush operation is heavy especially when the remote server suffering the performance issues.

Changes

  • Separate flush from put/putAll/delete functions in CoordinatorStreamStore.
  • Explicity call flush after call put/putAll/delete in related classes
  • Batch write task partition assignments information to metadata store.
  • Batch write task container information to metadata store.

Tests

  • All unit tests and integration tests are passed

API Changes

  • Replace writeTaskPartitionAssignment with new batch write method writeTaskPartitionAssignments in TaskPartitionAssignmentManager
  • Replace writeTaskContainerMapping with new batch write method writeTaskContainerMappings in TaskAssignmentManager.

Upgrade Instructions

None

Usage Instructions

None

@prateekm
Copy link
Contributor

@bharathkk and @lakshmi-manasa-g can you take a look?

@xinyuiscool
Copy link
Contributor

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

@xinyuiscool Do you mean we should remove flush calls from put/putall/delete/deleteall? And let the callers of those functions decide when to do flush operation?

@xinyuiscool
Copy link
Contributor

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

@lakshmi-manasa-g Thanks for the review. Had one discussion with @xinyuiscool and @prateekm , we decided to change with another solution which will move the flush out of all update methods in CoordinatorStreamStore. So above codes with your comments will be removed. I will let you know when the new changes are done.

@prateekm
Copy link
Contributor

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

@prateekm
Just checked the PR: #1125. I believe it is solving exactly the same performance issue with this one.
But it is doing with batch updates way, and current PR is doing the way to decouple flush and update methods.

If we solve the issue with this PR, later we can close it.

1. Batch write task partition assignments information to metadata store.
2. Batch write task container information to metadata store.

Signed-off-by: Alan Zhang <[email protected]>
Signed-off-by: Alan Zhang <[email protected]>
Signed-off-by: Alan Zhang <[email protected]>
@alnzng
Copy link
Contributor Author

alnzng commented Jan 28, 2020

The changes are done and updated the details in the PR's description, please help review, thanks!
@xinyuiscool @prateekm @lakshmi-manasa-g

@mynameborat
Copy link
Contributor

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

@alnzng here is the PR for context - #1112
+1 to splitting flush and the API. Refer to the comments in the above PR regarding separating out flush and put APIs

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for addressing the comments.

@alnzng
Copy link
Contributor Author

alnzng commented Feb 4, 2020

@xinyuiscool @prateekm
Can any of you help check and merge the PR if no new comments? Thanks.

Copy link
Contributor

@xinyuiscool xinyuiscool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the quick fix!

@xinyuiscool xinyuiscool merged commit ca641dc into apache:master Feb 5, 2020
@alnzng alnzng deleted the SAMZA-2444 branch February 5, 2020 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants