Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

API for monthly_active_users table #3633

Merged
merged 29 commits into from
Aug 8, 2018
Merged

API for monthly_active_users table #3633

merged 29 commits into from
Aug 8, 2018

Conversation

neilisfragile
Copy link
Contributor

Doesn't actually do anything other than ensure that table purges old values.

Will form basis for extending #3630

@neilisfragile neilisfragile changed the base branch from master to develop July 31, 2018 15:39
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but I think this needs to be hooked up so that we write to it, otherwise its hard to see whether this is the right way of doing things.

But broadly looks good so far!

* limitations under the License.
*/

-- a table of users who have requested that their details be erased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIES AND SLANDER

sql = """
SELECT COALESCE(count(*), 0) FROM (
SELECT user_id FROM monthly_active_users
) u
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be SELECT COALESCE(count(*), 0) FROM monthly_active_users?

)

def clean_out_monthly_active_users(self):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is spurious


defer.returnValue(
result
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's jeff?

I'd probably write it/reformat to simply be:

defer.returnValue(bool(user_present))

"timestamp": int(self._clock.time_msec()),
},
lock=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upserts without locks are dangerous as they're racey. Upserts with locks are bad and should be avoided.

I'd quite like to see how this is actually hooked up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reworked nits, and left upserts until we figure out a better way to solve this

@defer.inlineCallbacks
def _populate_monthly_active_users(self, user_id):
store = self.hs.get_datastore()
print "entering _populate_monthly_active_users"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prints. Please use logger if you want to have that

else:
count = yield store.get_monthly_active_count()
print "count is %d" % count
if count < self.hs.config.max_mau_value:
Copy link
Contributor

@krombel krombel Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to return somehow if this condition is false.

@neilisfragile
Copy link
Contributor Author

monthly_active_users now fully backing log in and register tests.
table is reaped, and read requests cached

Especially keen to get comment on adding @defer.inlineCallbacks to insert_client_ip
Also, the caching.

user_id(str): the user_id to query
"""

store = self.hs.get_datastore()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store is the same as self here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point

Cleans out monthly active user table to ensure that no stale
entries exist.
Return:
Defered()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred with two r's. Also, Returns with an s, and a newline before it. (This is so that IDEs can read it to get type hints)

sql = """
DELETE FROM monthly_active_users
ORDER BY timestamp desc
LIMIT -1 OFFSET ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to include a LIMIT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need it for Postgres, but best I can tell I do need it for sqlite https://sqlite.org/lang_select.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that's true. But it doesn't look like postgres supports negative limits. You can instead probably structure the query as:

DELETE FROM monthly_active_users
WHERE user_id NOT IN (
   SELECT user_id FROM monthly_active_users
   ORDER BY timestamp DESC
   LIMIT ?
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groans, I guess the table never gets too large

# Have resolved to invalidate the whole cache for now and do
# something about it if and when the perf becomes significant
self.is_user_monthly_active.invalidate_all()
self.get_monthly_active_count.invalidate_all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These invalidations should only happen after the transaction has committed, which can be done by yielding on the self.runInteraction

such as monthly active user limiting or global disable flag
Args:
error (Error): The error that should be raised if user is to be
blocked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd. Either this should raise an AuthError and callers can catch AuthErrors and re-raise as a more appropriate exception, or it should just return a boolean.

Generates current count of monthly active users.abs
Return:
Defered(int): Number of current monthly active users
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its a bit nitty, but consistency is quite important for readability:

"""Generates current count of monthly active users.abs

Returns:
    Defered[int]: Number of current monthly active users
""""

(Indentation, returns and square brackets for the type Deferred returns)

@@ -93,6 +95,25 @@ def insert_client_ip(self, user_id, access_token, ip, user_agent, device_id,

self._batch_row_update[key] = (user_agent, device_id, now)

@defer.inlineCallbacks
def _populate_monthly_active_users(self, user_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be in monthly_active_users.py

if self.hs.config.limit_usage_by_mau:
is_user_monthly_active = yield store.is_user_monthly_active(user_id)
if is_user_monthly_active:
yield store.upsert_monthly_active_user(user_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid having to do an update on every request. One way of avoiding this is to have the is_user_monthly_active return the timestamp and use that to only insert if the difference is greater than, say, a couple of hours. Since that DB function is cached we avoid doing queries most of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

lock=False,
)
self.is_user_monthly_active.invalidate((user_id,))
self.get_monthly_active_count.invalidate(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only invalidate if we actually added a new user to the table (rather than just updating), helpfully _simple_upsert returns True on insert and False on update.

@erikjohnston
Copy link
Member

Especially keen to get comment on adding @defer.inlineCallbacks to insert_client_ip

I think that's fine, so long as we don't actually perform a query in the common case.

)
timestamp = None
if len(result) > 0:
timestamp = result[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a _simple_select_one_onecol ftr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handy, done

if count < self.hs.config.max_mau_value:
yield self.upsert_monthly_active_user(user_id)
elif now - last_seen_timestamp > LAST_SEEN_GRANULARITY:
yield self.upsert_monthly_active_user(user_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you quickly add a comment to say why we're not always upserting?

-- a table of monthly active users, for use where blocking based on mau limits
CREATE TABLE monthly_active_users (
user_id TEXT NOT NULL,
timestamp BIGINT NOT NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick comment saying what timestamp is? In particular, its probably worth calling out that it isn't necessarily accurate.

# If MAU user count still exceeds the MAU threshold, then delete on
# a least recently active basis.
# Note it is not possible to write this query using OFFSET due to
# incompatibilities in how sqlite an postgres support the feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/an/and/


@cached(num_args=0)
def get_monthly_active_count(self):
"""Generates current count of monthly active users.abs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, editor autocomplete

user_id TEXT NOT NULL,
-- Last time we saw the user. Not guaranteed to be accurate due to rate limiting
-- on updates, Granularity of updates governed by
-- syanpse.storage.monthly_active_users.LAST_SEEN_GRANULARITY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/syanpse/synapse/

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo spelling mistakes, thanks @krombel

@neilisfragile neilisfragile merged commit 990fe9f into develop Aug 8, 2018
@neilisfragile neilisfragile deleted the neilj/mau_tracker branch August 8, 2018 12:44
richvdh added a commit that referenced this pull request Aug 22, 2018
Features
--------

- Add support for the SNI extension to federation TLS connections. Thanks to @vojeroen! ([\#3439](#3439))
- Add /_media/r0/config ([\#3184](#3184))
- speed up /members API and add `at` and `membership` params as per MSC1227 ([\#3568](#3568))
- implement `summary` block in /sync response as per MSC688 ([\#3574](#3574))
- Add lazy-loading support to /messages as per MSC1227 ([\#3589](#3589))
- Add ability to limit number of monthly active users on the server ([\#3633](#3633))
- Support more federation endpoints on workers ([\#3653](#3653))
- Basic support for room versioning ([\#3654](#3654))
- Ability to disable client/server Synapse via conf toggle ([\#3655](#3655))
- Ability to whitelist specific threepids against monthly active user limiting ([\#3662](#3662))
- Add some metrics for the appservice and federation event sending loops ([\#3664](#3664))
- Where server is disabled, block ability for locked out users to read new messages ([\#3670](#3670))
- set admin uri via config, to be used in error messages where the user should contact the administrator ([\#3687](#3687))
- Synapse's presence functionality can now be disabled with the "use_presence" configuration option. ([\#3694](#3694))
- For resource limit blocked users, prevent writing into rooms ([\#3708](#3708))

Bugfixes
--------

- Fix occasional glitches in the synapse_event_persisted_position metric ([\#3658](#3658))
- Fix bug on deleting 3pid when using identity servers that don't support unbind API ([\#3661](#3661))
- Make the tests pass on Twisted < 18.7.0 ([\#3676](#3676))
- Don’t ship recaptcha_ajax.js, use it directly from Google ([\#3677](#3677))
- Fixes test_reap_monthly_active_users so it passes under postgres ([\#3681](#3681))
- Fix mau blocking calulation bug on login ([\#3689](#3689))
- Fix missing yield in synapse.storage.monthly_active_users.initialise_reserved_users ([\#3692](#3692))
- Improve HTTP request logging to include all requests ([\#3700](#3700))
- Avoid timing out requests while we are streaming back the response ([\#3701](#3701))
- Support more federation endpoints on workers ([\#3705](#3705), [\#3713](#3713))
- Fix "Starting db txn 'get_all_updated_receipts' from sentinel context" warning ([\#3710](#3710))
- Fix bug where `state_cache` cache factor ignored environment variables ([\#3719](#3719))
- Fix bug in v0.33.3rc1 which caused infinite loops and OOMs ([\#3723](#3723))
- Fix bug introduced in v0.33.3rc1 which made the ToS give a 500 error ([\#3732](#3732))

Deprecations and Removals
-------------------------

- The Shared-Secret registration method of the legacy v1/register REST endpoint has been removed. For a replacement, please see [the admin/register API documentation](https://github.com/matrix-org/synapse/blob/master/docs/admin_api/register_api.rst). ([\#3703](#3703))

Internal Changes
----------------

- The test suite now can run under PostgreSQL. ([\#3423](#3423))
- Refactor HTTP replication endpoints to reduce code duplication ([\#3632](#3632))
- Tests now correctly execute on Python 3. ([\#3647](#3647))
- Sytests can now be run inside a Docker container. ([\#3660](#3660))
- Port over enough to Python 3 to allow the sytests to start. ([\#3668](#3668))
- Update docker base image from alpine 3.7 to 3.8. ([\#3669](#3669))
- Rename synapse.util.async to synapse.util.async_helpers to mitigate async becoming a keyword on Python 3.7. ([\#3678](#3678))
- Synapse's tests are now formatted with the black autoformatter. ([\#3679](#3679))
- Implemented a new testing base class to reduce test boilerplate. ([\#3684](#3684))
- Rename MAU prometheus metrics ([\#3690](#3690))
- add new error type ResourceLimit ([\#3707](#3707))
- Logcontexts for replication command handlers ([\#3709](#3709))
- Update admin register API documentation to reference a real user ID. ([\#3712](#3712))
-- on updates, Granularity of updates governed by
-- synapse.storage.monthly_active_users.LAST_SEEN_GRANULARITY
-- Measured in ms since epoch.
timestamp BIGINT NOT NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really surprised this is called timestamp, which is a reserved keyword in postgres and most SQL (it's a data type), as you can from the syntax highlighting. It also doesn't describe what the column is used for - surely it should be last_active or similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it's also a problem because you can't rename columns in sqlite, so we really want to get it right the first time).

);

CREATE UNIQUE INDEX monthly_active_users_users ON monthly_active_users(user_id);
CREATE INDEX monthly_active_users_time_stamp ON monthly_active_users(timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be monthly_active_users_timestamp to be consistent, as timestamp is one word

@richvdh
Copy link
Member

richvdh commented Aug 22, 2018

since we're in the business of finding problems in PRs after they have landed:

  • why was this in synapse/storage/schema/delta/51 rather than synapse/storage/schema/delta/50 ? Doing so has made it impossible for people to roll back from synapse 0.33.3 to 0.33.2 without manual database work.
  • Please can we use CREATE TABLE IF NOT EXISTS and CREATE INDEX IF NOT EXISTS. There are occasions when a delta fails half-way through, and the IF NOT EXISTS makes the delta idempotent.

@erikjohnston
Copy link
Member

We usually create a new delta file after a release? Anyway, do we really re run schema files even if they've been applied before? But agreed should try and remember IF NOT EXISTS, though CREATE INDEX doesn't support it on postgres 9.4

@richvdh
Copy link
Member

richvdh commented Aug 22, 2018

We usually create a new delta file after a release?

A new delta file, sure. A new directory? Well, I didn't get the memo, and I question the value of doing so vs the problems it presents for anyone trying to roll back.

Anyway, do we really re run schema files even if they've been applied before?

we do not, but the update code does get upset if either (a) the database schema number is higher than it knows about or (b) it finds records of delta files from version X already having been applied when trying to upgrade from X-1 to X.

CREATE INDEX doesn't support it on postgres 9.4

gahhhh

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants