-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to subset cohort based on 'during target cohort' #175
Comments
Using your example from above, wouldn't this work whereby we'd want to find the subset cohort that starts on or after the target cohort start and also ends on after the target cohort start: .priorEpilepsy <- CohortGenerator::createCohortSubset(
name = "Prior Epilepsy",
cohortIds = 12403, #use epilepsy cohort
startWindow = CohortGenerator::createSubsetCohortWindow(
startDay = 0,
endDay = 9999,
targetAnchor = "cohortStart"
),
endWindow = CohortGenerator::createSubsetCohortWindow(
startDay = 0,
endDay = 9999,
targetAnchor = "cohortStart"
),
cohortCombinationOperator = "all",
negate = F
) Let me know if I've misunderstood the use-case here. |
No, your example is 'The target cohort starts anytime after the subset cohort starts and the target cohort starts anytime after subset cohort ends. So this would pass:
This is what we want to identify:
The statement: The subset starts between 0 and 999 days after Target start and the subset starts between -999 and 0 of the target end. The key is being able to make 2 statements about the subset cohort start, but the way the API works, your startWindow talks about the subset start date, and the endWindow talks about the subset end date. Of course this isn't possible because startWindow can only be specified once, but the point is I need to say that the subset cohort's start is after the target's start and before the target's end. (2 statements about subset cohort start date, not one statement about the start and one statement about the end). In the above example, it's a little confusing to say "Prior Epilepsy" but if we said 'Epilepsy During' it might make more sense: Epilepsy During means that you saw the start of the epilepsy after your target's start date, and you also saw the epilepsy before the target's end. Note, with the current API, you could do this:
But that would only match on this:
Where the subset starts after T, but also ends before T's end...but that's not what we really want here. That's because this would not match (from the first example):
|
Thanks @chrisknoll for this detail it helps to illustrate the problem. You are right that we cannot model the 'subset cohort starts during the target cohort' with the current API. We can do one of:
OR
But not both in the same cohort subset definition. Here is a reprex "sandbox" that shows the 2nd case here whereby the subset overlaps the target cohort end: connectionDetails <- Eunomia::getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
sql <- "
DELETE FROM main.cohort;
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 1, 1, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 1, 2, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 1, 3, strftime('%s', '2000-01-01'), strftime('%s', '2001-12-31');
-- Starts in target but ends after
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 2, 1, strftime('%s', '2001-01-01'), strftime('%s', '2002-12-31');
-- No overlap with target
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 2, 2, strftime('%s', '2003-01-01'), strftime('%s', '2004-12-31');
-- Subsumed by target
INSERT INTO main.cohort (COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE)
SELECT 2, 3, strftime('%s', '2000-02-01'), strftime('%s', '2001-01-01');
"
DatabaseConnector::executeSql(
connection = conn,
sql = sql
)
#> Executing SQL took 0.15 secs
sql <- "SELECT * FROM main.cohort ORDER BY subject_id, cohort_definition_id;"
rs <- DatabaseConnector::querySql(
connection = conn,
sql = sql
)
rs
#> COHORT_DEFINITION_ID SUBJECT_ID COHORT_START_DATE COHORT_END_DATE
#> 1 1 1 2000-01-01 2001-12-31
#> 2 2 1 2001-01-01 2002-12-31
#> 3 1 2 2000-01-01 2001-12-31
#> 4 2 2 2003-01-01 2004-12-31
#> 5 1 3 2000-01-01 2001-12-31
#> 6 2 3 2000-02-01 2001-01-01
cs <- CohortGenerator::createCohortSubsetDefinition(
name = "Test Cohort Subset",
definitionId = 1,
subsetOperators = list(
CohortGenerator::createCohortSubset(
name = "Overlap End Date",
cohortIds = 2,
startWindow = CohortGenerator::createSubsetCohortWindow(
startDay = -99999,
endDay = 0,
targetAnchor = "cohortEnd"
),
endWindow = CohortGenerator::createSubsetCohortWindow(
startDay = 0,
endDay = 99999,
targetAnchor = "cohortEnd"
),
cohortCombinationOperator = "all",
negate = F
)
)
)
sql <- cs$getSubsetQuery(targetOutputPair = c(1, 1001))
sql <- SqlRender::render(
sql = sql,
cohort_database_schema = "main",
cohort_table = "cohort")
sql <- SqlRender::translate(
sql = sql,
targetDialect = "sqlite"
)
DatabaseConnector::executeSql(
connection = conn,
sql = sql
)
#> Executing SQL took 0.02 secs
# Write SQL to file to view
# SqlRender::writeSql(
# sql = sql,
# targetFile = "subsetQuery.sql"
# )
sql <- "SELECT * FROM main.cohort ORDER BY subject_id, cohort_definition_id;"
rs <- DatabaseConnector::querySql(
connection = conn,
sql = sql
)
rs
#> COHORT_DEFINITION_ID SUBJECT_ID COHORT_START_DATE COHORT_END_DATE
#> 1 1 1 2000-01-01 2001-12-31
#> 2 1001 1 2000-01-01 2001-12-31
#> 3 2 1 2001-01-01 2002-12-31
#> 4 1 2 2000-01-01 2001-12-31
#> 5 2 2 2003-01-01 2004-12-31
#> 6 1 3 2000-01-01 2001-12-31
#> 7 2 3 2000-02-01 2001-01-01
DatabaseConnector::disconnect(conn) Created on 2024-08-06 with reprex v2.1.0 As shown above, We'd have to break the API to allow for specifying the subset start/end as part of this cohort subset definition, best I can tell, unless you have a suggestion on how to change this and maintain backwards compatibility? Tagging @azimov for awareness. |
Yes, in order to keep backwards compatabiilty, we need to leave the existing stuff alone and let new stuff come in without disturbing the existing. here's a thought: As part of the subset operator, we add a field called
So the list of elements in windows gets converted to the date diff + start/end date of the target and subset cohort. The chagne would be createSubsetCohortWindow should take an additional parameter for specifying the subsetCohort's anchor. I think this could be an easy modification where int he cases where you're reading the window from startWindow and endWindow, you force the 'anchor' to be cohrotStart or cohrotEnd (to keep the current behavior). Else just use the windows specified in the |
This didn't actually seem too hard to implement so I made a PR see #197 @chrisknoll I would greatly appreciate you input here as this is pretty much based off your specification. This should be 100% backwards compatible with old designs and maintains API consistency with a bit of a hack around "start window" and "end window" because the cohort window functions now need to store if they are based on the subset cohorts start or end. |
The current cohort subset logic provides a time window logic like:
This says the subset cohort starts within 365d of target cohort start date, and the subset cohort ends any time relative to the target's cohort start.
But how would we implment the logic: 'The subset cohort starts between 0 and 9999 days of the target cohort start , and the subset cohort starts between -9999 and 0 days of the target cohort end'?
The issue is that the start window talks about the subset cohort start date, the end window talks about the subset cohort end date, but how do you talk about the subset cohort start date twice (once about the cohort start and once about the cohort end).
The limitation seems to prevent us from doing any rule that says 'subset cohort starts during the target cohort'.
The text was updated successfully, but these errors were encountered: