Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a gitcoin admin, I'd like the mailchimp syncing to be refactored, so the load on the database is lowered and the site runs faster #4784

Closed
danlipert opened this issue Jul 12, 2019 · 6 comments

Comments

@danlipert
Copy link
Contributor

User Story

As a gitcoin admin, I'd like the mailchimp syncing to be refactored, so the load on the database is lowered and the site runs faster

Why Is this Needed

Summary:While the sync_mail management command runs, it slows down the site significantly.

Description

Type: Fug

Current Behavior

The mailchimp lists sync bi-directionally every 2 hours

Expected Behavior

The mailchimp lists should sync at a lower rate, and the sync from mailchimp to gitcoin should be done less often since it puts a heavy load on the database.

Definition of Done

The current sync_mail cronjob and management command is split out into two different jobs, with the sync from mailchimp to gitcoin running only once a day, while the sync from gitcoin to mailchimp should run twice a day.

Data Requirements

While the mailchimp to gitcoin sync is running, the database often runs up to 0.25 seconds slower per request, resulting in a major performance hit.

Additional Information

Reported by @owocki and discovered by profiling the running jobs on the cronbox

@owocki
Copy link
Contributor

owocki commented Jul 12, 2019

specifically i think that the pull_to_db() section is slow. it'll do

20k db reads (for profile)
5k dbreads for matches
20k for the mailchimp list
1.5k for tips
5k for bounties
100 for whitepaper access requests

PLUS all of the above x2 for all the calls to get_or_save_email_subscriber()

@owocki
Copy link
Contributor

owocki commented Jul 12, 2019

two recommendations for how to make this job more efficient @danlipert : (these are both low lifts)

  1. consolidate the DB reads down to one per object type. a la instead of
    from dashboard.models import Subscription
    for sub in Subscription.objects.all():
        email = sub.email
        process_email(email, 'dashboard_subscription')

do a

    from dashboard.models import Subscription
    for email in Subscription.objects.all().values_list('email', flat=True):
        process_email(email, 'dashboard_subscription')
  1. limit the reads in pull_to_db() to only objects created in the last n hours, where n = the number of hours since the job last ran

@danlipert
Copy link
Contributor Author

Those look good - I think #1 will give a small improvement just grabbing one column instead of the whole row. #2 is good in general for sure - I think it'd also be good to have finer grained scheduling on these by splitting out the different steps as described in the issue. I wonder how we can avoid having a "magic number" for the n hours in both python code and in the crontab config - having the magic number led to problems in the mailchimp list getting desynced previously (was 2 hours in the python script but 6 hours in the crontab)

@owocki
Copy link
Contributor

owocki commented Jul 12, 2019

whats a "magic number"?

@danlipert
Copy link
Contributor Author

@owocki https://en.wikipedia.org/wiki/Magic_number_(programming) - in this case there was a value of 2 hours that needed to be set in two different places for the sync to work properly, instead of using a environment variable or something similar that each file would import from.

@owocki
Copy link
Contributor

owocki commented Jul 15, 2019

PR is in at #4798

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants