-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support altering the sites calibrated using EMOS #1706
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1706 +/- ##
==========================================
+ Coverage 98.14% 98.23% +0.08%
==========================================
Files 111 115 +4
Lines 10244 10764 +520
==========================================
+ Hits 10054 10574 +520
Misses 190 190
Continue to review full report at Codecov.
|
…nly a single forecast and truth exist for a site.
…forecast and truth data pairs for calibration with EMOS.
…id column when trying to add and remove new sites.
32cf229
to
be684b7
Compare
It would be useful if someone from BoM could review the changes in dataframe_utilities.py to ensure that this is compatible with your work. Other changes in this PR (e.g. in calibration/utilities.py) don't require a BoM review, unless you'd like to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks ok to me, and I think changes are consistent with BoM's intended use of the station_id
column. I have made a few small suggestions for code clarity and performance.
self.expected_period_forecast.data[:, :2, -1] = np.nan | ||
self.expected_period_truth.data[:2, -1] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will modifying properties of self
cause problems if the data is used in other tests (if not now, maybe for tests added in future)? Should this be a copy instead?
Edit: actually, after some googling I see that a new instance of the class is created for each test, so modifying its properties does not affect other tests. In that case, why do the tests make copies of the dataframe properties before modifying them? (e.g. in the line just below this, df = self.forecast_df.copy()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the copying is out of an abundance of caution about tests conflicting. I think you're right that quite a lot of the copying isn't really necessary, so I've removed most of this in 0026589.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me, but I'm not an expert on test frameworks, so possibly there is some good reason for the copies that I'm not aware of. Would be good if another reviewer can comment.
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
985c4d2
to
338fd03
Compare
338fd03
to
0026589
Compare
Thanks a lot for the review comments @btrotta-bom. I think that I've now responded to these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gavinevans, changes look fine.
…sent only on one of the forecast or truth dataframes.
8333c63
to
eb6ecf0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran unit and acceptance tests.
Just a couple of questions and comments.
Can't really contribute to the copy question as I follow convention set in IMPROVER, maybe someone in the green team could help?
Thanks
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Outdated
Show resolved
Hide resolved
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to approve.
Note: I have not addressed the unit test copy question.
…hen filling missing entries.
…station_id column when new sites are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with the changes made in this PR and the tests run properly.
Thanks for your previous review on this PR, @btrotta-bom. I've recently pushed up a couple more commits to this PR following some further testing. These commits relate to handling the station_id column when new sites are added to the parquet files. Would you be able to, or interested in, reviewing these latest commits? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, I made a couple of suggestions for clarity.
# Add wmo_id as a static column, if station ID is present in both the | ||
# forecast and truth DataFrames. | ||
static_cols.append("wmo_id") | ||
elif ("station_id" in forecast_df.columns) and not include_station_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the second part of the condition and just use elif "station_id" in forecast_df.columns:
(and similarly for truth_df
below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, you're correct that the second part of this condition isn't actually required here.
self.forecast_df_multi_station_id["station_id"] = ( | ||
self.forecast_df_multi_station_id["wmo_id"] + "0" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this station_id
column is added in all the tests that use self.forecast_df_multi_station_id
, and it is always defined the same way, so maybe this could be done in the setup instead. (Similarly for self.truth_df_multi_station_id
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've centralised this as suggested.
Thanks for the review comments, @btrotta-bom. Could you take another look at this please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gavinevans, this looks fine.
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
improver_tests/calibration/dataframe_utilities/test_dataframe_utilities.py
Show resolved
Hide resolved
* Initial commit to add code with the aim of handling instances where only a single forecast and truth exist for a site. * Add functionality for checking whether there are sufficient historic forecast and truth data pairs for calibration with EMOS. * Remove commented out code. * Minor formatting edits. * Add additional unit test. * Use representation_type, rather than percentile. * Modifications related to handling the possible presence of a station_id column when trying to add and remove new sites. * Minor edits following review. * Remove copy statements. * Switch to raise a warning, rather than an error, if station_id is present only on one of the forecast or truth dataframes. * Minor edits following review comments. * Minor edit to ensure that the station_id column is filled correctly when filling missing entries. * Modifications following further testing to support the presence of a station_id column when new sites are added. * Minor edits following review. * Add comments for clarification. * Minor edits to comments.
Addresses https://github.com/metoppv/mo-blue-team/issues/239
Description
The aim of this PR is to support adding new sites for use in estimating EMOS coefficients for site calibration. The changes in this PR primarily consist of:
fill_missing_entries
function to pad a dataframe, if a site is only present at some validity times. This then allows the construction of a cube with consistent dimensions at different validity times.check_data_sufficiency
function to consider whether there is sufficient valid data, rather than NaNs, following the padding.Following #1698, this PR has been updated to support the possible presence of a
station_id
column. The primary amendments related to this are:station_id
column is present only on one of the forecast or truth dataframes.station_id
column have been moved into a single check to avoid any unforeseen circumstances from having different checks at different points in the code.A few other minor changes have also been made mainly to correct the ordering of the dimension coordinates within unit tests to match the expectations that the forecast representation coordinate will always be the first dimension coordinate.
Testing: