-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequencies: Fix pivot logic to always include the end date #1121
Conversation
09aeeee
to
fe02380
Compare
Codecov ReportBase: 68.04% // Head: 68.08% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #1121 +/- ##
==========================================
+ Coverage 68.04% 68.08% +0.04%
==========================================
Files 62 62
Lines 6681 6690 +9
Branches 1640 1641 +1
==========================================
+ Hits 4546 4555 +9
Misses 1829 1829
Partials 306 306
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@victorlin I thought you might also be interested in the datetime implementation details in this PR, since you've thought about dates in Augur a lot. I'm especially curious if you have comments about the choice of datetime library or other considerations that I might have missed... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look at the date stuff after lab meeting.
tests/builds/zika.t
Outdated
@@ -86,7 +86,7 @@ Build a time tree from the existing tree topology, the multiple sequence alignme | |||
> --date-confidence \ | |||
> --date-inference marginal \ | |||
> --clock-filter-iqd 4 \ | |||
> --seed 314159 > /dev/null | |||
> --seed 314159 &> /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this out as a separate PR #1123, since it's independent and we could benefit from it separately. Should've done this sooner!
# Construct standard Python date instances from numeric dates via ISO-8601 | ||
# dates and the corresponding delta time for the interval between pivots. | ||
start = datetime.datetime.strptime(float_to_datestring(pivot_start), "%Y-%m-%d") | ||
end = datetime.datetime.strptime(float_to_datestring(pivot_end), "%Y-%m-%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above, pivot_end
is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates? And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end
is 3 months into the year.
What if we instead did something like this:
last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')
if pivot_interval_unit=='month':
while not last_data_point.is_month_end:
last_data_point += pd.Timedelta(days=1)
elif: pivot_interval_unit=='weeks':
while not last_data_point.weekday()!=6:
last_data_point += pd.Timedelta(days=1)
pivot_end = last_data_point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would put the end point to the next full month or week after the last data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above,
pivot_end
is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates?
The floating point representation of pivot_start
and pivot_end
is just an intermediate state as far as date calculations. The user can provide a human-readable (ISO-8601) date from the command line, it gets turned into a floating point value for historical reasons, but then it gets converted back to ISO-8601 before performing date math. pivot_end
and pivot_start
could be defined as human-readable (ISO-8601) dates from the start and only lead to floating point values at the end of the function. The floating point values returned by get_pivots
can be converted back to the expected ISO-8601 dates with float_to_datestring
. The benefit of the ISO-8601 representation is that we can perform standard date math like:
from augur.frequency_estimators import get_pivots, float_to_datestring
from augur.dates import numeric_date
# Find one month prior to March 6, 2023 with standard date logic.
# This should be February 6, 2023.
>>> date(2023, 3, 6) - relativedelta(months=1)
datetime.date(2023, 2, 6)
# Find one month prior to March 6, 2023 with floating point logic.
# What is "one month" in floating point values? If it is 1 / 12, we get the following result.
>>> float_to_datestring(numeric_date("2023-03-06") - (1 / 12))
'2023-02-04'
This is obviously only a problem for months since they don't have the same number of days. Week intervals are consistent between standard date math and floating point math.
And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end is 3 months into the year.
Ah, that's right and not great. With a single observation from that date, you'd get:
>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3)]
['2023-01-02', '2023-04-02']
We usually run augur frequencies
with an end date of the current day, though. In the current PR, if you had a single observation from Jan 2 and an end date of today, you'd get:
>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-11']
But in the current implementation in master
, you'd get:
>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-01']
and throw out your single observation. For a monthly interval, your code example above would produce:
>>> import datetime
>>> import numpy as np
>>> import pandas as pd
>>> from treetime.utils import datetime_from_numeric
>>> end_date = None
>>> observations = [numeric_date("2023-01-02")]
>>> last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')
>>> while not last_data_point.is_month_end:
... last_data_point += pd.Timedelta(days=1)
>>> last_data_point
Timestamp('2023-01-31 00:00:00')
Since the pivots slice through a specific time in the KDE distributions instead of summing over a time interval, is there a benefit to moving the last pivot any later than we have data instead of using the latest observation date as the end date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I didn't realize TreeTime already had datetime_from_numeric
and datestring_from_numeric
. We could just use the latter instead of Augur's float_to_datestring
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, John. and you are right, the end_date
solves this problem in most cases. But it also means that our monthly pivots are not aligned with the beginning or end of a month. If that's ok, we might just do month=30days and week=7days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data. They would be too heavily influenced by fluctuations in that case. This is less of an issue for the diffusion frequencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it also means that our monthly pivots are not aligned with the beginning or end of a month.
That alignment to the beginning of a month was the original goal of the pandas-based implementation, but it is a pretty arbitrary goal. Even representing intervals by month feels arbitrary when "month" doesn't have a fixed value. The simpler approach of using weeks is appealing without needing to cast "month" to a fixed number of days. I would be happy to switch flu pivots to every 4 weeks, for instance.
Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data.
This also makes a lot of sense. Estimating KDE frequencies at a date beyond the most recent data is an error in the opposite direction of the original bug addressed in this PR. By that logic, we should change the following logic from:
pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.ceil(np.max(observations) / pivot_frequency) * pivot_frequency
to always use a min/max date based on the earliest and latest sample, instead of expanding the range of pivots:
pivot_start = start_date if start_date else np.min(observations)
pivot_end = end_date if end_date else np.max(observations)
Our tree can't plot any dates beyond the most recent tip, so it makes sense that frequencies shouldn't plot beyond the most recent tip either. The behavior of that code I propose there doesn't account for our most common use case where we do provide an "end date" that is usually today even if the most recent data are never collected from today. Providing an "end date" only makes sense in the context of retrospective work like the flu forecasting project, where we want to estimate frequencies in well-defined windows across existing data.
Providing an earlier "start date" makes sense, though, since it allows the pivots to include all available data. The code above could produce pivots that don't include the earliest observation, depending on the pivot interval and end date. So, we probably want something like this that keeps the earlier start date than min observation but uses the max observation as the end date.
pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.max(observations)
The existing changes in this PR will ensure that the end date is always the last pivot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is more work we could do in this area, but I think we should leave treatment of observed data (without start and end dates) to another PR. This PR fixes a bug that immediately affects our flu work, so I'd like to merge and release it soon. I documented this conversation as a new issue.
Corrects existing tests of pivot logic to check the specific dates in the monthly tests and to use the correct number of expected weekly pivots from a 1-year interval (53 instead of 52). Adds a test of realistic start and end dates for monthly pivots that we know fails under the current implementation, as revealed by a crash in a recent seasonal flu build.
Fixes the previous, flawed pivot logic that allowed pivots to _not_ include the end date by a substantial margin (for example: given an end date of January 6, 2023 and a 3-month interval, pivots ended at November 1, 2022). The new implementation uses a third-party date library, dateutil, that supports month-based time deltas in addition to the week-based deltas supported by most other Python date libraries. This new implementation ensures that the end date always appears as the last entry in the the pivot outputs (as we always wanted it to), works backward through time to create pivots from that end date minus the given time delta interval, and stops when the next pivot occurs before the start date. With respect to the start date, this behavior is similar to the prior implementation which never guaranteed that the start date would be included as a pivot unless the pivot interval allowed it. A key technical detail of this new implementation is that the internal and intermediate pivot datetime instances are converted to floating point values with the `numeric_date` function instead of the `timestamp_to_float` function. The first of these calculates decimal values from the fraction of days in a year, while the second calculates decimal values from the fraction of months in a year. The day-based fractions from `numeric_date` are already used throughout Augur and TreeTime, while `timestamp_to_float` has only been used for frequency and distance calculations. Since other functions like `float_to_datestring` also interpret decimals as fractions of days in a year, the change to pivot logic here makes the output of `get_pivots` more consistent with the rest of Augur's date ecosystem.
Corrects unit tests that only used to work because of their anticipation of rounding errors when converting between floating point dates that interpret decimal values as fractions of days in a year vs. fractions of months in a year.
Updates the specific values of expected frequencies for functional tests, since the corrected pivot calculation logic produces new pivots and correspondingly new frequency values per pivot. One nice side effect of the new pivot calculations is that they produce the correct number of pivots in our functional test of relative min and max date values that had previously failed in a flaky manner. The corrected pivot logic reveals that the "flaky" behavior of that test was actually the correct behavior: we should always get 3 pivots when the max date is 6 months ago, the min date is 12 months ago, and the pivot interval is 3 months. That we previously only got 2 pivots from these relative dates most times of the year reflects the invalid pivot calculation logic from a different perspective. This commit reactivates that previously commented-out test. Fixes #963
2e571e1
to
dcd256f
Compare
Thanks to @victorlin's improvements to `numeric_date`, we can pass a date object directly without first converting it to a date string.
Description of proposed changes
Background
Augur estimates frequencies of clades and other tip-based attributes by creating a grid of time points (internally called "pivots") and evaluating the frequency of clades, etc. at each time point in the grid. The logic to implement pivots has been historically complicated by the tension between representing dates as floating point values and wanting to support meaningful human dates. The current implementation of this logic in
get_pivots
uses pandasdate_range
function to generate monthly or weekly intervals between a given start and end date. I selected this pandas function to calculate pivots because it appeared to implement the meaningful human date logic of monthly intervals that we wanted to get monthly pivots.Details
Unfortunately, the current
get_pivots
logic does not behave exactly the way we want it to. Although it supports monthly and weekly pivots between a given start and end date, it can produce pivots such that the final pivot is substantially earlier than the given end date. Since the "end date" in an analysis is typically the current date such that the final pivot tells the us the frequency of clades at the current time, the earlier placement of the final pivot causes us to ignore the most recent tips in an analysis and underestimate clade frequencies. This bug emerged from debugging a separate augur frequencies error in seasonal flu builds.This PR fixes the previous, flawed pivot logic that allowed pivots to not include the end date by a substantial margin (for example: given an end date of January 6, 2023 and a 3-month interval, pivots ended at November 1, 2022). The new implementation uses dateutil an existing transitive dependency of Augur's that supports month-based time deltas in addition to the week-based deltas supported by most other Python date libraries. This new implementation ensures that the end date always appears as the last entry in the the pivot outputs (as we always wanted it to), works backward through time to create pivots from that end date minus the given time delta interval, and stops when the next pivot occurs before the start date. With respect to the start date, this behavior is similar to the prior implementation which never guaranteed that the start date would be included as a pivot unless the pivot interval allowed it.
A key technical detail of this new implementation is that the internal and intermediate pivot datetime instances are converted to floating point values with the
numeric_date
function instead of thetimestamp_to_float
function. The first of these calculates decimal values from the fraction of days in a year, while the second calculates decimal values from the fraction of months in a year. The day-based fractions fromnumeric_date
are already used throughout Augur and TreeTime, whiletimestamp_to_float
has only been used for frequency and distance calculations. Since other functions likefloat_to_datestring
also interpret decimals as fractions of days in a year, the change to pivot logic here makes the output ofget_pivots
more consistent with the rest of Augur's date ecosystem.Considerations for datetime libraries to use for month-based time delta logic
The logic we need for month-based pivots requires a datetime library that supports month-based time deltas. Most Python datetime libraries do not support month deltas, since the underlying logic is complicated and messy compared to other standard delta offsets like hour, day, week, or year.
Libraries that do not support month-based time deltas include:
Libraries that do support month-based time deltas include:
parse_duration
function. For example, we can rundate(2023, 1, 6) - isodate.parse_duration("P3M")
, to get three months prior to Jan 6. Thanks to @victorlin for pointing this one out!relativedelta
function. For example, we can rundate(2023, 1, 6) - relativedelta(months=3)
, to get the date three months prior to Jan 6, 2023. This library is stable, well-maintained, and an existing dependency of Augur's through pandas.pendulum.duration
function. For example, we can runpendulum.parse("2023-01-06") - pendulum.duration(months=3)
, to get the date three months prior to Jan 6, 2023. Note that Pendulum development has stalled a bit in the last couple of years and its creator is looking for maintainers.shift
method on date instances. For example, we can runarrow.get("2017-01-06").shift(months=-3)
, to get the same date as the Pendulum call above. Arrow development similarly stalled for a while and was looking for maintainers, although they now appear to have at least 2 regular maintainers.All of these libraries produce the same results for month- and week-based pivots. All of these libraries also have a large number of users and relatively* stable APIs. Pendulum users have complained about breaking changes since in the 2.X releases and new bugs from that series that have not been resolved, although there has been more movement in the last year to step up maintenance.
dateutil was initially the most appealing choice, since it produces the exact output we want, interacts with and produces native Python date objects and is an existing dependency of Augur via Pandas.
As Victor pointed out in review,
isodate
is even better since it is already an explicit dependency and also produces the same output as the dateutil method. Usingisodate
here also paves the way to expose an ISO-8601 duration interface for the pivot calculations, so users could define arbitrary pivot length and units in a single argument (e.g., daily pivots with--pivot-interval "P1D"
).Breaking changes
This PR changes the behavior of an entire subcommand (augur frequencies) and defines a new explicit dependency from a transitive dependency, so we will need to release this as part of a major release to reflect these changes.
Impacts on scientific results
The inaccurate pivot logic for augur frequencies has the potential to impact scientific results that depend on these frequencies when estimated at monthly intervals. The biggest impact would be for estimates made at the end of a given month. For example, estimates at a 1-month interval on the last day of the month would produce a last pivot at the beginning of the current month. In the worst case where the interval is higher like a 3-month interval, the last pivot could be three months before the end date.
The most extensive use of these monthly frequencies is in our seasonal influenza analyses including:
The estimates affected above are clearly not ideal in that the last two cases do not always reflect all data that were available at the time the estimates were made. However, in the case of seasonal influenza genomes, the average delay between sample collection and submission to GISAID is about 2 months. Given this delay, the 1-month interval estimates that only reflect data up to the beginning of the current month are minimally affected by the issue described in this PR.
The worst case of those described above is for the VCM report on September 2022 that failed to reflect data collected after August 1. Inspection of GISAID data collected between August 1, 2022 and September 18, 2022 and submitted before or on September 18, 2022 shows that our forecasts missed 162 records. By comparison, another 455 records from this same collection time period were submitted after the forecasts were made (in other words, only ~25% of the data that would become available for that time period was available when we made our forecasts). Of these 162 records, 110 (68%) were collected in Spain and 142 (88%) were collected in Europe. A substantial portion of these would have been eliminated by our regional subsampling (an approach to minimize regional and temporal sequencing bias in our phylogenies). However, 77 of these missing sequences map to the emerging 2a.2/50K clade, indicating that we potentially underestimated the frequency of this growing clade at the time of forecasts.
Our other prominent use of augur frequencies is in the ncov workflow. Since we use 1-week intervals to estimate SARS-CoV-2 frequencies, these analyses are not affected by the issue in this PR.
Related issue(s)
Fixes #963
Testing
Checklist