Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequencies: Fix pivot logic to always include the end date #1121

Merged
merged 7 commits into from
Jan 19, 2023

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jan 8, 2023

Description of proposed changes

Background

Augur estimates frequencies of clades and other tip-based attributes by creating a grid of time points (internally called "pivots") and evaluating the frequency of clades, etc. at each time point in the grid. The logic to implement pivots has been historically complicated by the tension between representing dates as floating point values and wanting to support meaningful human dates. The current implementation of this logic in get_pivots uses pandas date_range function to generate monthly or weekly intervals between a given start and end date. I selected this pandas function to calculate pivots because it appeared to implement the meaningful human date logic of monthly intervals that we wanted to get monthly pivots.

Details

Unfortunately, the current get_pivots logic does not behave exactly the way we want it to. Although it supports monthly and weekly pivots between a given start and end date, it can produce pivots such that the final pivot is substantially earlier than the given end date. Since the "end date" in an analysis is typically the current date such that the final pivot tells the us the frequency of clades at the current time, the earlier placement of the final pivot causes us to ignore the most recent tips in an analysis and underestimate clade frequencies. This bug emerged from debugging a separate augur frequencies error in seasonal flu builds.

This PR fixes the previous, flawed pivot logic that allowed pivots to not include the end date by a substantial margin (for example: given an end date of January 6, 2023 and a 3-month interval, pivots ended at November 1, 2022). The new implementation uses dateutil an existing transitive dependency of Augur's that supports month-based time deltas in addition to the week-based deltas supported by most other Python date libraries. This new implementation ensures that the end date always appears as the last entry in the the pivot outputs (as we always wanted it to), works backward through time to create pivots from that end date minus the given time delta interval, and stops when the next pivot occurs before the start date. With respect to the start date, this behavior is similar to the prior implementation which never guaranteed that the start date would be included as a pivot unless the pivot interval allowed it.

A key technical detail of this new implementation is that the internal and intermediate pivot datetime instances are converted to floating point values with the numeric_date function instead of the timestamp_to_float function. The first of these calculates decimal values from the fraction of days in a year, while the second calculates decimal values from the fraction of months in a year. The day-based fractions from numeric_date are already used throughout Augur and TreeTime, while timestamp_to_float has only been used for frequency and distance calculations. Since other functions like float_to_datestring also interpret decimals as fractions of days in a year, the change to pivot logic here makes the output of get_pivots more consistent with the rest of Augur's date ecosystem.

Considerations for datetime libraries to use for month-based time delta logic

The logic we need for month-based pivots requires a datetime library that supports month-based time deltas. Most Python datetime libraries do not support month deltas, since the underlying logic is complicated and messy compared to other standard delta offsets like hour, day, week, or year.

Libraries that do not support month-based time deltas include:

  • datetime.timedelta: does not support month deltas
  • pandas: does not support month deltas and produces unexpected date ranges with month frequencies (the root cause of the issue that this PR resolves)
  • numpy timedelta: technically implements a month-based time delta but does not support applying month time deltas to specific dates. For example, we want to run the following code, but it produces an error:
import numpy as np
np.datetime64('2023-01-06') - np.timedelta64(3, 'M')
# UFuncTypeError: Cannot cast ufunc 'subtract' input 1 from dtype('<m8[M]') to dtype('<m8[D]') with casting rule 'same_kind'

Libraries that do support month-based time deltas include:

  • isodate supports month deltas through its parse_duration function. For example, we can run date(2023, 1, 6) - isodate.parse_duration("P3M"), to get three months prior to Jan 6. Thanks to @victorlin for pointing this one out!
  • dateutil supports this logic with its relativedelta function. For example, we can run date(2023, 1, 6) - relativedelta(months=3), to get the date three months prior to Jan 6, 2023. This library is stable, well-maintained, and an existing dependency of Augur's through pandas.
  • Pendulum supports this logic through the pendulum.duration function. For example, we can run pendulum.parse("2023-01-06") - pendulum.duration(months=3), to get the date three months prior to Jan 6, 2023. Note that Pendulum development has stalled a bit in the last couple of years and its creator is looking for maintainers.
  • Arrow supports this logic through its shift method on date instances. For example, we can run arrow.get("2017-01-06").shift(months=-3), to get the same date as the Pendulum call above. Arrow development similarly stalled for a while and was looking for maintainers, although they now appear to have at least 2 regular maintainers.

All of these libraries produce the same results for month- and week-based pivots. All of these libraries also have a large number of users and relatively* stable APIs. Pendulum users have complained about breaking changes since in the 2.X releases and new bugs from that series that have not been resolved, although there has been more movement in the last year to step up maintenance.

dateutil was initially the most appealing choice, since it produces the exact output we want, interacts with and produces native Python date objects and is an existing dependency of Augur via Pandas.

As Victor pointed out in review, isodate is even better since it is already an explicit dependency and also produces the same output as the dateutil method. Using isodate here also paves the way to expose an ISO-8601 duration interface for the pivot calculations, so users could define arbitrary pivot length and units in a single argument (e.g., daily pivots with --pivot-interval "P1D").

Breaking changes

This PR changes the behavior of an entire subcommand (augur frequencies) and defines a new explicit dependency from a transitive dependency, so we will need to release this as part of a major release to reflect these changes.

Impacts on scientific results

The inaccurate pivot logic for augur frequencies has the potential to impact scientific results that depend on these frequencies when estimated at monthly intervals. The biggest impact would be for estimates made at the end of a given month. For example, estimates at a 1-month interval on the last day of the month would produce a last pivot at the beginning of the current month. In the worst case where the interval is higher like a 3-month interval, the last pivot could be three months before the end date.

The most extensive use of these monthly frequencies is in our seasonal influenza analyses including:

  1. Huddleston et al. 2020. These analyses used 6-month frequency intervals between start and end dates based on October 1 (the beginning of a month). Because these analyses rely on the first day of a month, they are not affected by the issue in this PR.
  2. WHO VCM reports. These reports historically have used either 1-month or 2-month intervals to estimate current frequencies. For example, the most recent VCM report used 2-month intervals, so frequencies estimated in the middle of September only reflected data available through August 1. However, almost all prior reports used 1-month intervals where the estimates any time during a given month would reflect data up to the beginning of that month.
  3. Public seasonal influenza forecasts on nextstrain.org. We only produce forecasts for 2y H3N2 HA trees and these trees used 1-month frequency intervals. We produce these forecasts weekly, so some forecasts will be made at the end of a month. These forecasts will be affected by the issue in this PR in that samples collected since the beginning of the month will not be included in the forecasts.

The estimates affected above are clearly not ideal in that the last two cases do not always reflect all data that were available at the time the estimates were made. However, in the case of seasonal influenza genomes, the average delay between sample collection and submission to GISAID is about 2 months. Given this delay, the 1-month interval estimates that only reflect data up to the beginning of the current month are minimally affected by the issue described in this PR.

The worst case of those described above is for the VCM report on September 2022 that failed to reflect data collected after August 1. Inspection of GISAID data collected between August 1, 2022 and September 18, 2022 and submitted before or on September 18, 2022 shows that our forecasts missed 162 records. By comparison, another 455 records from this same collection time period were submitted after the forecasts were made (in other words, only ~25% of the data that would become available for that time period was available when we made our forecasts). Of these 162 records, 110 (68%) were collected in Spain and 142 (88%) were collected in Europe. A substantial portion of these would have been eliminated by our regional subsampling (an approach to minimize regional and temporal sequencing bias in our phylogenies). However, 77 of these missing sequences map to the emerging 2a.2/50K clade, indicating that we potentially underestimated the frequency of this growing clade at the time of forecasts.

Our other prominent use of augur frequencies is in the ncov workflow. Since we use 1-week intervals to estimate SARS-CoV-2 frequencies, these analyses are not affected by the issue in this PR.

Related issue(s)

Fixes #963

Testing

  • Add new unit tests for expected behavior
  • Correct previous unit and functional tests to reflect desired behavior
  • Test by CI

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@huddlej huddlej force-pushed the fix-monthly-pivots branch 3 times, most recently from 09aeeee to fe02380 Compare January 9, 2023 23:22
@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Base: 68.04% // Head: 68.08% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (0591500) compared to base (5e1babb).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1121      +/-   ##
==========================================
+ Coverage   68.04%   68.08%   +0.04%     
==========================================
  Files          62       62              
  Lines        6681     6690       +9     
  Branches     1640     1641       +1     
==========================================
+ Hits         4546     4555       +9     
  Misses       1829     1829              
  Partials      306      306              
Impacted Files Coverage Δ
augur/frequency_estimators.py 36.16% <100.00%> (+1.07%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@huddlej huddlej marked this pull request as ready for review January 10, 2023 00:36
@huddlej huddlej requested review from rneher and victorlin January 10, 2023 00:36
@huddlej
Copy link
Contributor Author

huddlej commented Jan 10, 2023

@victorlin I thought you might also be interested in the datetime implementation details in this PR, since you've thought about dates in Augur a lot. I'm especially curious if you have comments about the choice of datetime library or other considerations that I might have missed...

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look at the date stuff after lab meeting.

@@ -86,7 +86,7 @@ Build a time tree from the existing tree topology, the multiple sequence alignme
> --date-confidence \
> --date-inference marginal \
> --clock-filter-iqd 4 \
> --seed 314159 > /dev/null
> --seed 314159 &> /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this out as a separate PR #1123, since it's independent and we could benefit from it separately. Should've done this sooner!

augur/frequency_estimators.py Show resolved Hide resolved
augur/frequency_estimators.py Outdated Show resolved Hide resolved
augur/frequency_estimators.py Outdated Show resolved Hide resolved
# Construct standard Python date instances from numeric dates via ISO-8601
# dates and the corresponding delta time for the interval between pivots.
start = datetime.datetime.strptime(float_to_datestring(pivot_start), "%Y-%m-%d")
end = datetime.datetime.strptime(float_to_datestring(pivot_end), "%Y-%m-%d")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, pivot_end is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates? And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end is 3 months into the year.

What if we instead did something like this:

last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')

if pivot_interval_unit=='month':
    while not last_data_point.is_month_end:
         last_data_point += pd.Timedelta(days=1)
elif: pivot_interval_unit=='weeks':
    while not last_data_point.weekday()!=6:
         last_data_point += pd.Timedelta(days=1)

pivot_end = last_data_point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would put the end point to the next full month or week after the last data point.

Copy link
Contributor Author

@huddlej huddlej Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, pivot_end is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates?

The floating point representation of pivot_start and pivot_end is just an intermediate state as far as date calculations. The user can provide a human-readable (ISO-8601) date from the command line, it gets turned into a floating point value for historical reasons, but then it gets converted back to ISO-8601 before performing date math. pivot_end and pivot_start could be defined as human-readable (ISO-8601) dates from the start and only lead to floating point values at the end of the function. The floating point values returned by get_pivots can be converted back to the expected ISO-8601 dates with float_to_datestring. The benefit of the ISO-8601 representation is that we can perform standard date math like:

from augur.frequency_estimators import get_pivots, float_to_datestring
from augur.dates import numeric_date

# Find one month prior to March 6, 2023 with standard date logic.
# This should be February 6, 2023.
>>> date(2023, 3, 6) - relativedelta(months=1)
datetime.date(2023, 2, 6)

# Find one month prior to March 6, 2023 with floating point logic.
# What is "one month" in floating point values? If it is 1 / 12, we get the following result.
>>> float_to_datestring(numeric_date("2023-03-06") - (1 / 12))
'2023-02-04'

This is obviously only a problem for months since they don't have the same number of days. Week intervals are consistent between standard date math and floating point math.

And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end is 3 months into the year.

Ah, that's right and not great. With a single observation from that date, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3)]
['2023-01-02', '2023-04-02']

We usually run augur frequencies with an end date of the current day, though. In the current PR, if you had a single observation from Jan 2 and an end date of today, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-11']

But in the current implementation in master, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-01']

and throw out your single observation. For a monthly interval, your code example above would produce:

>>> import datetime
>>> import numpy as np
>>> import pandas as pd
>>> from treetime.utils import datetime_from_numeric 
>>> end_date = None
>>> observations = [numeric_date("2023-01-02")]
>>> last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')
>>>  while not last_data_point.is_month_end:
...    last_data_point += pd.Timedelta(days=1)
>>> last_data_point
Timestamp('2023-01-31 00:00:00')

Since the pivots slice through a specific time in the KDE distributions instead of summing over a time interval, is there a benefit to moving the last pivot any later than we have data instead of using the latest observation date as the end date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I didn't realize TreeTime already had datetime_from_numeric and datestring_from_numeric. We could just use the latter instead of Augur's float_to_datestring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, John. and you are right, the end_date solves this problem in most cases. But it also means that our monthly pivots are not aligned with the beginning or end of a month. If that's ok, we might just do month=30days and week=7days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data. They would be too heavily influenced by fluctuations in that case. This is less of an issue for the diffusion frequencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it also means that our monthly pivots are not aligned with the beginning or end of a month.

That alignment to the beginning of a month was the original goal of the pandas-based implementation, but it is a pretty arbitrary goal. Even representing intervals by month feels arbitrary when "month" doesn't have a fixed value. The simpler approach of using weeks is appealing without needing to cast "month" to a fixed number of days. I would be happy to switch flu pivots to every 4 weeks, for instance.

Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data.

This also makes a lot of sense. Estimating KDE frequencies at a date beyond the most recent data is an error in the opposite direction of the original bug addressed in this PR. By that logic, we should change the following logic from:

pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.ceil(np.max(observations) / pivot_frequency) * pivot_frequency

to always use a min/max date based on the earliest and latest sample, instead of expanding the range of pivots:

pivot_start = start_date if start_date else np.min(observations)
pivot_end = end_date if end_date else np.max(observations)

Our tree can't plot any dates beyond the most recent tip, so it makes sense that frequencies shouldn't plot beyond the most recent tip either. The behavior of that code I propose there doesn't account for our most common use case where we do provide an "end date" that is usually today even if the most recent data are never collected from today. Providing an "end date" only makes sense in the context of retrospective work like the flu forecasting project, where we want to estimate frequencies in well-defined windows across existing data.

Providing an earlier "start date" makes sense, though, since it allows the pivots to include all available data. The code above could produce pivots that don't include the earliest observation, depending on the pivot interval and end date. So, we probably want something like this that keeps the earlier start date than min observation but uses the max observation as the end date.

pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.max(observations)

The existing changes in this PR will ensure that the end date is always the last pivot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is more work we could do in this area, but I think we should leave treatment of observed data (without start and end dates) to another PR. This PR fixes a bug that immediately affects our flu work, so I'd like to merge and release it soon. I documented this conversation as a new issue.

huddlej and others added 5 commits January 19, 2023 12:53
Corrects existing tests of pivot logic to check the specific dates in
the monthly tests and to use the correct number of expected weekly
pivots from a 1-year interval (53 instead of 52). Adds a test of
realistic start and end dates for monthly pivots that we know fails
under the current implementation, as revealed by a crash in a recent
seasonal flu build.
Fixes the previous, flawed pivot logic that allowed pivots to _not_
include the end date by a substantial margin (for example: given an end
date of January 6, 2023 and a 3-month interval, pivots ended at November
1, 2022). The new implementation uses a third-party date library,
dateutil, that supports month-based time deltas in addition to the
week-based deltas supported by most other Python date libraries. This
new implementation ensures that the end date always appears as the last
entry in the the pivot outputs (as we always wanted it to), works
backward through time to create pivots from that end date minus the
given time delta interval, and stops when the next pivot occurs before
the start date. With respect to the start date, this behavior is similar
to the prior implementation which never guaranteed that the start date
would be included as a pivot unless the pivot interval allowed it.

A key technical detail of this new implementation is that the internal
and intermediate pivot datetime instances are converted to floating
point values with the `numeric_date` function instead of the
`timestamp_to_float` function. The first of these calculates decimal
values from the fraction of days in a year, while the second calculates
decimal values from the fraction of months in a year. The day-based
fractions from `numeric_date` are already used throughout Augur and
TreeTime, while `timestamp_to_float` has only been used for frequency
and distance calculations. Since other functions like
`float_to_datestring` also interpret decimals as fractions of days in a
year, the change to pivot logic here makes the output of `get_pivots`
more consistent with the rest of Augur's date ecosystem.
Corrects unit tests that only used to work because of their anticipation
of rounding errors when converting between floating point dates that
interpret decimal values as fractions of days in a year vs. fractions of
months in a year.
Updates the specific values of expected frequencies for functional
tests, since the corrected pivot calculation logic produces new pivots
and correspondingly new frequency values per pivot. One nice side effect
of the new pivot calculations is that they produce the correct number of
pivots in our functional test of relative min and max date values that
had previously failed in a flaky manner. The corrected pivot logic
reveals that the "flaky" behavior of that test was actually the correct
behavior: we should always get 3 pivots when the max date is 6 months
ago, the min date is 12 months ago, and the pivot interval is 3 months.
That we previously only got 2 pivots from these relative dates most
times of the year reflects the invalid pivot calculation logic from a
different perspective. This commit reactivates that previously
commented-out test.

Fixes #963
@huddlej huddlej force-pushed the fix-monthly-pivots branch from 2e571e1 to dcd256f Compare January 19, 2023 20:54
Thanks to @victorlin's improvements to `numeric_date`, we can pass a
date object directly without first converting it to a date string.
@huddlej huddlej merged commit 85d2edd into master Jan 19, 2023
@huddlej huddlej deleted the fix-monthly-pivots branch January 19, 2023 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive
3 participants