-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of "where" in cell_methods #173
Comments
Hi Karl, Interesting examples! In general, I think that non-standardised comments are the all we currently have at our disposal for cases such as these, but I may have missed a trick. In that light, here are some suggestions for your four cases (but I make no claim that these are the best options): In Case 1, as far as I understand it, whether or not a calculation (such as a mean) was weighted is, by default, unspecified. Even though using "where" might suggest that each contributing element represents a different area, this is true in general for cells taken in their entirety. So to indicate that weighted were use we would indeed need In case 2, I think that "where" refers to a portion of the grid cell defined by the "name", and "name" has to refer to spatially defined cells (because the only valid "where"s are area_types). So I wonder if the best we can currently do is another comment: In case 3, similarly to 2: In case 4, The obvious question is "Do we want/need a more standardised feature to express information about weights?". Thinking ... |
Thanks @davidhassell . I think your parenthetical descriptions make it clear how the means have been calculated except in case 4. To specify how the time-means are weighted (as you did in distinguishing case 3 from case 2), I think one would need: Regarding your last remark, it might be worth thinking about adding a construct similar to but more specific than "where areatype". Perhaps something like:
for example for a variable E (and notation as in #173 (comment) Not all standard_names that would be commonly needed exist (e.g.,
where
For the last option listed (stand-alone The following examples illustrate the variety of weighting accommodated by this more general approach:
The above examples are different weightings used in producing CMIP6 variables, but the cell_methods assigned to those variables does not in some cases adequately indicate it. So there is a real need to do something. Whether it needs to be done in a standardized way or through a parenthetical comment is what we should first decide. |
Hi Karl, I agree that for case 4, On the broader question, I like your ideas, and wonder if it would be good to not roll the "where <type>" into the weights description, i.e.
where <weight_type> is a CV of If where is also set then it would act as a modifier to <weight_type>. E.g. for This seems to fit in with existing use quite nicely, and removes the need for new standard names. E.g.:
|
My first reaction is that your approach, @davidhassell , is better. Thanks for thinking of it. Will give it some more thought and consider the implications for the "over type2" modifier that might also be include in a cell_methods. |
Dear Karl @taylor13 and @davidhassell Thanks for these questions and the discussion. Are these all actual use-cases, or is this anticipating a need? I would like to suggest that we already have the syntax for these cases, if we clarify or generalise the interpretation a bit. The text of section 7.3.3 on "Statistics applying to portions of cells" says
This syntax with "over" is thus a generalisation of "mean where type", which could also be expressed as "mean where type over type". It is calculated by summing over the type portion of the cell and dividing by the area of the type portion. Perhaps we ought to have said "integrating" rather than "summing". It must mean "integrating" because nothing else would make sense. If we are going to sum quantity X over an area and divide by an area, and we want the quotient to have the units of X, the "sum" must have units of X times units of area i.e. it's an area-integral. Hence, I infer that "mean where type" is the area-weighted mean over type. I believe that is what we had in mind and how it's been interpreted up to now, but if I'm right it could be clarified in the text. Therefore
is "area: mean where sea_ice". This is consistent with the text above, except that the text speaks of "cells". We ought to rephrase it somehow e.g. with "region", in case the area-mean is aggregating more than one cell, as in Karl's use-case. For case 2, the current text of 7.3.3 is too restrictive at the outset in saying "the statistical method indicated by The remainder of 7.3.3 talks only about
is "area: mean where sea-ice time: mean where sea-ice". First we compute the quantity in the sea-ice portion of the cell, which I suppose might give missing data when there is no sea ice in the cell, then we compute the time-mean of the epochs when there is sea ice present. We can express both
as "area: time: mean where sea-ice". The difference between this and the previous case is that the mean is done over both dimensions at once. We compute the double integral ∫∫ X H(X) dA dt over area and time, and divide it by ∫∫ H(X) dA dt, where H(X) is the function that is 1 if the type exists and 0 if it does not. The numerator has units of X times metres times seconds, the denominator has units of metres times seconds, so the quotient has units of X as required. With the double integral, the time-epochs are weighted according to the area of sea-ice at each time, instead of equally weighted in the time-mean. Case 4 is just the same. Karl's use-case is for Best wishes Jonathan |
Dear Jonathan, This is very interesting! This "... where ... over ..." text has, I presume erroneously, been formatted as an example description rather than main-body text since CF-1.6 - and that's the excuse I'm giving for the fact that I don't recall reading it :) Do you agree that it should be re-instated? If so, I'll raise over at https://github.com/cf-convention/cf-conventions/issues.
Works for me.
I agree
I agree
Works for me.
Works for me. |
Dear @davidhassell Yes, I agree with you that the text after Ex 7.7 should be "unindented". It is main text, not part of the example. I hadn't noticed. That is a defect which we should correct. The other points on which we are agree are perhaps enhancements, or arguably also defects because the intent of the convention is not clear. Best wishes Jonathan |
Simply being more explicit (and eliminating misinterpretation) as to what the "where" and "over" directives mean, and generalizing them to cover non-spatial dimensions may be all that is necessary. First, to answer some questions raised: Raising this issue was motivated by re-examining the CMIP6 output specifications, so it is an existing "use-case". It seemed to me that someone preparing CMIP6 output must have had trouble deciding exactly how to compute reported values with the current guidance provided by the CF standards document. I think the cases originally enumerated above, if clarified, would make interpretation straight-forward for CMIP6 output (perhaps with a few exceptions). Note that it is not only the variable, "age of sea ice", that needs to be clarified in CMIP6, but this was used as an example. Jonathan asked: "For a quantity like that [i.e., one that is only defined where the area_type exists], I am not sure we really need "where sea-ice" in case 1, do we? Yes, I think that if we make clear how one should calculate the statistic in this case, then the where might in some cases become unnecessary. Consider a calculation of the mean_age_of_snow (F) on sea ice when cell_methods is specified as "where sea_ice". Isn't there a danger that a user would calculate this as: mean = sum(over cells)[s_iA_iF_i]/sum(over cells)[s_i*A_i] What is wanted is: mean = sum(over cells)[s_iA_iH(s_i)F_i] / sum(over cells)[s_iA_i*H(s_i)] I think this should be made clear in the standards document. |
Dear Karl @taylor13 Yes, I agree, this should be clarified. The danger you mention arises because the data-writer is unclear what to assume for a quantity which is only defined for a certain area type in those areas where it's not defined. The data-writer might simply omit them from the mean, as if they were missing data, which is what you want. On the other hand, they would get an underestimate if they assume a value of zero. For age of snow on sea ice I don't think it would make sense to assume zero where there is no sea ice, but it might be done. For depth of snow on sea ice it would arguably be reasonable to assume zero where no sea ice. Certainly we need to be clearer about this. We could insert a clarification as a new paragraph before Example 7.7. For instance,
Would that be sufficient and clear? Cheers Jonathan |
Dear Jonathan, Yes, I think the suggested text would be very helpful, and the recommendation should be followed by anyone who wants to guard against data being misinterpreted. I remembered why I thought we might need to add a new qualifier ("weighted_by") to the cell_methods: to distinguish between datasets already written where the weighting may be ambiguous and datasets that will be written under the new, more explicit rules we're now considering for cell_methods. I think in the past, a mean, for example, could have been written with each mean computed from equally weighted samples (rather than weighted by area, as we now propose should be done). A data user won't know (without looking at the conventions attribute, if one is provided) whether "area: mean" implies unambiguously "weighted by area" or not, even though under our present proposal it should by default mean "area-weighted". Moreover, what if under the new scheme we don't want samples to be area-weighted? Consider a very sparse observational network used to sample some quantity like precipitation rate, where the measurements are known to be statistically independent. Suppose these measurements are reported on a grid (of cells of unequal area). To estimate the mean value for the region, one would likely simply weight each sample equally, without regard to the area of the cell. I think under the current wording of the convention, one would permit this, and a careful data write would include a cell_methods = "area: mean (with each observational site weighted equally)". This would differ from a mean computed from a full-coverage simulated field of precipitation where the mean might better more accurately be calculated with cell_methods = "area: mean", which under the new rules would be unambiguously interpreted as an area-weighted mean. Does the "clarification of cell_methods" we're discussing provide for a mean that is not area-weighted? There are other weightings possible (such as those described in #173 (comment)). In particular suppose we have a 3-d field reported on an atmospheric grid with altitude as the vertical coordinate. How should a mass-weighted mean be indicated by the cell_methods. For example in computing the mean water vapor mixing ratio, each sample should be weighted by the mass of air in the cell. Should this be indicated in a parenthetical statement or should we indicate it in a more standard way? cheers, |
By the way, I support extending "where" to mean "where or when". |
Dear Karl I agree that it is not clear in cell methods whether weighing of any kind has been applied. This doesn't apply just to means and areal statistics, but is a general point. At the moment, weighting is mentioned only in passing, in sect 7.3.2, where we say, "For instance, an area-weighted mean over latitude could be indicated as I don't think that we should introduce any new assumption about weighting, but maybe we should make a statement about it near the start of 7.3, and refer to 7.3.2 for the syntax of recording a comment about the weighting. We could recommend that such a comment is included if it might be important information for the user of the data. What guidance would you give? Best wishes Jonathan |
Hello, I find it confusing that it is unspecified whether or not area-weighting was applied for I also support extending "where" to mean "where or when". |
I think you have got it right, and I agree it's confusing. I believe this reflects the unstated assumption we've always made that means are area-weighted, which we ought to state. Nonetheless alternatives are possible. For example, if several grid-cells are included in the region, |
What are the correct cell_methods specifications for the following four cases for characterizing the "age_of_sea_ice" [Let E represent age, A the grid cell area, and s the fraction of the area covered by sea ice. Let i be the grid-cell index and n be the time-sample index for N samples.]:
sum(over cells) [ s_i * A_i * E_i ] / sum(over cells) [ s_i * A_i ]
Should cell_methods be "area: mean where sea ice"? Is the weighting by sea-ice area (s_i * A_i) assumed or does a comment need to be included?
We want to compute a time-mean of the sea ice age in a single grid cell, weighted equally across all time-samples:
sum(over time samples) [delta_n * E_n ] / sum(over time samples) [delta_n]
where delta_n is a function set to 1 if s_n > 0 and set to 0 if s_n=0.
Should cell_methods be "time: mean where sea ice"? How should omission of the sea-ice free samples be indicated? Is the "where" directive reserved for use only for spatial dimensions?
sum(over time samples) [ s_n * E_n ] / sum(over time samples) [ s_n ]
Should cell_methods be "time: mean where sea ice"? Is the weighting by sea-ice area (s_n) assumed or does a comment need to be included? Is the "where" directive reserved for use only for spatial dimensions?
sum (over i & n) [ s_i,n * A_i * E_i,n ] / sum (over i & n) [ s_i,n * A_i ]
Should cell_methods be "area: time: mean where sea ice" or "area: mean where sea ice time: mean" or "area: mean where sea ice time: mean where sea ice" or "area: mean time: mean where sea ice" or what? Is the weighting by sea-ice area (s_i,n * A_i) assumed or does a comment need to be included?
The text was updated successfully, but these errors were encountered: