Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a CF domain variable #301

Closed
davidhassell opened this issue Sep 22, 2020 · 38 comments · Fixed by #302
Closed

Introducing a CF domain variable #301

davidhassell opened this issue Sep 22, 2020 · 38 comments · Fixed by #302
Assignees
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@davidhassell
Copy link
Contributor

davidhassell commented Sep 22, 2020

Introducing a CF domain variable

Moderator

@dblodgett-usgs

Moderator Status Review [last updated: 2020-10-15]

The proposal has been submitted and preliminarily reviewed by the moderator. Attention should be called to the potential for this proposal to subtly but fundamentally alter how CF-NetCDF data fields and domains are treated. Review from authors of CF-NetCDF client software is necessary here.

  1. By in large, the community is supportive of the proposal.
  2. There has been discussion of how to identify a domain variable: by cf_role: domain or by presence of a dimensions: "X Y Z ..." attribute. Presence of a dimensions attribute has won out for it's lack of redundancy.
  3. The title of section 5 will now be: "Coordinate systems and domain"
  4. There is some nuanced discussion of domain constructs for scalar (single-valued dimensionless / degenerate) coordinate variables. No major issues have been noted.
  5. There is discussion of multiple domain variables for a single domain. No major issues have been noted.
  6. The idea of adding a domain: domain_variable attribute on a data variable was suggested. Adding it would introduce redundancy and seems to be the wrong path.

As of 10-15-2020, discussion is slow but ongoing. I will check back in around the beginning of November.

Requirement Summary

The concept of a domain that describes data locations and cell properties is not currently mentioned in the CF conventions, because it does not correspond to any single entity in the netCDF file. Instead, the domain is stored implicitly in a number of other variables and attributes that are linked to the data variable in various ways defined by the conventions.

The domain is, however, well defined in the CF data model as an abstract concept (as opposed to a data model construct) that provides the linkage between the field construct and the metadata constructs that describe the relevant data locations and cell properties. There is currently no "domain construct" in the data model, since there is no corresponding CF-netCDF entity.

There is a need to be able to describe a domain independently of any data variables, which is currently not possible. Use cases include:

  • Curated data streaming services for which it is impractical to send very large domain descriptions with every file.

  • Storing time-dependent coordinates from remote sensing applications.

  • Storing geometries without any timeseries data.

For such use cases, it is not satisfactory to try to locate an appropriate multidimensional data variable that describes the required domain, nor to create a dummy data variable for this purpose, which has no physical meaning.

Therefore, the inclusion of CF-netCDF domain variables that can encode a domain independently of any data, and a corresponding data model domain construct, will enhance CF by meeting these use cases.

Technical Proposal Summary

NetCDF encoding

A new "domain variable" will be introduced that is of arbitrary type since it contains no data. This variable will act as a container to bind together other variables that collectively define a domain, in a similar manner to how a data variable performs the same task.

It will support the same CF attributes as are allowed on the data variable for describing a domain, with exactly the same meanings and syntaxes: cell_measures, coordinates, geometry, and grid_mapping. These will be indicated as domain variable attributes by the additional "Do" indicator (short for Domain) in the "Use" column of Appendix A: Attributes.

Any future CF attributes that a data variable may use to describe its domain will be similarly transferred to the domain variable, meaning that keeping the domain variable up to date with other enhancements will be a well defined and easy task.

There is no mechanism for referencing a domain variable from a data variable, i.e. a data variable must still encode its domain in the current, implicit manner. This is to preserve backwards compatibility with all existing software libraries that understand the current structure of a data variable; and to reduce redundancy or incompatibility issues that may arise if a data variable encodes its own domain and references a domain variable.

A domain variable may exist in a file with or without other data variables.

Data model

The domain in the data model will be transformed from an abstract concept into a "top-level" construct, i.e. one that can exist in the absence of any other constructs. Currently, the field construct (corresponding to a CF-netCDF data variable) is the only top-level construct.

The new domain construct will replace the current domain concept, replicating it every in every way apart from that it will be related to the field construct via an aggregation relationship, rather than by the current composition relationship of the abstract domain concept. This makes it clear that the domain construct can exist independently from the field construct.

It is of no consequence to the data model that a CF-netCDF data variable will not be able to explicitly reference a CF-netCDF domain variable. That is an encoding choice that does not affect the logical structure.

Location in the conventions document

  • The domain variable will be described in a new section: 5.8 Domain Variables

  • The following appendices will updated:

  • Appendix A: Attributes

  • Appendix I: The CF data model

  • CF Conformance Requirements and Recommendations

Benefits

All those who meet the use cases described in the Requirements summary will benefit from the new domain variable.

Status Quo

At present, a domain can only be encoded implicitly via a data variable, leading to ambiguities when retrieving a domain from a dataset.

Associated pull request

#302

Detailed Proposal

Conventions text has been proposed in chapter 5, appendices A and I, and the conformance document in pull request #302

@davidhassell davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Sep 22, 2020
@dblodgett-usgs
Copy link
Contributor

@davidhassell -- I'm in support of this in concept and would be willing to moderate the discussion. I will review the PR in detail soon.

Others, please review. Comments on detailed aspects of the PR can be in line, but please put all general discussion here.

@davidhassell
Copy link
Contributor Author

@dblodgett-usgs Thanks for moderating

@dblodgett-usgs
Copy link
Contributor

@davidhassell -- I left a couple comments on your PR #302 to seed some further discussion.

See #302 (comment)

We need to be very aware that this change will loosen / modify the field-variable-centric nature of CF. I've always seen this as a central tenant of CF that was both annoying and super useful. This addition will make a lot of confusing things possible. Not really an argument for or against, but something to keep in mind.


The other thing is that I'm a little confused about how a domain variable works when it doesn't reference any dimensions. @davidhassell points to scalar coordinate variables as a case where this is valid. This does make sense but calls out the need for inclusion of examples showing how this would work.

A coordinates attribute would be required in the case that any coordinate variables that make up the domain are scalar coordinates.


Those are two general comments for people to consider.

I do want to pose a potential change for consideration.

Considering that this domain concept is new, we have an opportunity to require some things. I think it would be worth requiring all coordinate variables be declared in coordinates whether they use the dimension name or not. If there are other metadata that would otherwise by inferred, I think they should be required in this new domain variable construct. I'm thinking about this as a developer who wants a well-described domain declaration that doesn't require any special knowledge or inference to fully construct the spatio-temporal domain.

Regards, Dave

@davidhassell
Copy link
Contributor Author

Hi Dave,

Thanks for the comments. I very much appreciate your having taken the time to look over it.

This addition will make a lot of confusing things possible.

This is a good point, and we should be sure that the motives for introducing it are valid. It would be good to hear from some of the people whose use cases I ever so briefly mentionedabove, to provide a better picture of why this domain variable will be worth while. My personal interest is just in helping CF along - so I'm not really qualified to speak on their behalf. For visibility, I'll flag @ajelenak @AndersMS @erget @oceandatalab @dblodgett-usgs who may be able to help with this (thanks!).

Considering that this domain concept is new, we have an opportunity to require some things. I think it would be worth requiring all coordinate variables be declared in coordinates whether they use the dimension name or not. If there are other metadata that would otherwise by inferred, I think they should be required in this new domain variable construct. I'm thinking about this as a developer who wants a well-described domain declaration that doesn't require any special knowledge or inference to fully construct the spatio-temporal domain.

I wholly appreciate this stance, and indeed originally had that in mind. However, I came round to think that, as far as possible, the mechanics of the new domain variable should be identical to the equivalent mechanics of a data variable, e.g. that coordinate variables may be omitted form, or included in, the coordinates attribute.

This

  1. makes it easier to describe in the conventions and
  2. ensures the maximum consistency between the two ways of describing a domain (implicitly on a data variable and explicitly on a domain variable)

These points remove the need for duplicating parts of the conventions with partial modifications and reduce the possibility of misunderstandings between two almost identical, but not quite the same, encodings.

The second point also makes it much easier for developers who already deal with data variables for extracting domains. This is because they already have the machinery for decoding the domain from a data variable. I can say from experience that this is the case, having just today implemented the reading of the proposed domain variable in a branch of the cfdm library. This only needed ~30 lines of new code to modify the existing read-a-data-variable function to be able to read data variables or domain variables. This worked so easily because the attributes are are parsed and processed identically in both cases.

Remember that compression (DSG raggered arrays, gathering) also has to be considered. By stating that "things are the same as for a data variable" we get compression "for free", in terms of documentation, on the domain variable.

This does make sense but calls out the need for inclusion of examples showing how this would work.

Agreed. There is already an example showing a scalar coordinate variable, but not yet one with out any named dimensions. The more examples the better, I think.

@erget erget linked a pull request Sep 24, 2020 that will close this issue
4 tasks
@JonathanGregory
Copy link
Contributor

Dear David

Thanks for this proposal, which I support in its current form, with some minor points:

  • Maybe Section 5 should be renamed "Coordinate systems and domain" to recognise this new construct.
  • I think the presence of a dimensions attribute can be taken as defining the variable as a domain variable rather than a data variable - is that right? If so, and if it's not stated, I think it should be. Maybe also it should not be allowed to have a variable to have both dimensions and a dimensions attribute, to avoid confusion.
  • A scalar data variable has only one data value. A domain variable also has one data value (you can't have a netCDF variable with no data values). Is there really a need to allow domain variables for scalar domains, with the possibly surprising empty dimensions attribute?
  • You say, "It is of arbitrary type since it contains no data." I think it would be clearer to say e.g., "The variable should be a scalar (i.e. it has no dimensions) of arbitrary type, and the value of its single element is immaterial."
  • The conformance document would be more future-proof if you didn't explicitly list the attributes which aren't recommended, and refer instead to Appendix A.
  • I find the sentence describing this attributes as rather hard to understand. It says

It is recommended that a domain variable does not have any other attributes that are also used to directly describe data values, defined in [attribute-appendix] as those attributes that are used for non-coordinate data which also do not have domain variable nor global use.

for which would suggest

It is recommended that a domain variable does not have any of the attributes marked in Appendix A as applicable to data variables except those which are also marked as applicable to domain variables.

  • Typo in "blank separated list of the dimensions names".

Jonathan

@erget
Copy link
Member

erget commented Sep 24, 2020

Dear @davidhassell , I also support this proposal. This will be of benefit to remote sensing users and would ensure that we can implement the ongoing work discussed in cf-convention/discuss#37 in a way that is compatible with existing processing systems.

@davidhassell
Copy link
Contributor Author

davidhassell commented Sep 24, 2020

@JonathanGregory Thanks for your thoughtful comments. Responses inline ...

Maybe Section 5 should be renamed "Coordinate systems and domain" to recognise this new construct.

Good idea

I think the presence of a dimensions attribute can be taken as defining the variable as a domain variable rather than a data variable - is that right?

That's right.

If so, and if it's not stated, I think it should be. Maybe also it should not be allowed to have a variable to have both dimensions and a dimensions attribute, to avoid confusion.

It is already stated that "The presence of a dimensions attribute will identify the variable as a domain variable" (https://github.com/cf-convention/cf-conventions/pull/302/files#diff-0eab4e85fe4c323f70ce4bce0229dbe6R782-R783). It is, however, quite far down that paragraph. It may be better to promote the statement to the first sentence and strengthen it a bit, i.e.

The dimensions of the domain must be stored with the **`dimensions`**
attribute, and the presence of a **`dimensions`** attribute on  a scalar variable will identify the
variable as a domain variable.

I like the idea of being clearer that if scalar variable has the dimensions attribute then it has to be a domain variable. This is slightly different to disallowing the attribute on non-scalar variables.

You say, "It is of arbitrary type since it contains no data." I think it would be clearer to say e.g., "The variable should be a scalar (i.e. it has no dimensions) of arbitrary type, and the value of its single element is immaterial."

That's better. (Aside: We should also update the text for other similar containers. I cut-and-paste my text from grid mappings.)

The conformance document would be more future-proof if you didn't explicitly list the attributes which aren't recommended, and refer instead to Appendix A.

OK

I find the sentence describing this attributes as rather hard to understand. It says

I see what you mean. I like your text better. We should also change Appendix A from saying "D for variables containing non-coordinate data" to "D for data variables", then.

A scalar data variable has only one data value. A domain variable also has one data value (you can't have a netCDF variable with no data values). Is there really a need to allow domain variables for scalar domains, with the possibly surprising empty dimensions attribute?

I can only argue by counter example, here. Consider the domain of:

dimensions:
variables:
    double x ;
        x:standard_name = "global_average_sea_level_change" ;
        x:coordinates = "time" ;
        x:units = "rod" ;
    double time ;
        time:units = "days since 2020-09-24" ;

It would be:

dimensions:
variables:
    char domain ;
        domain:dimensions = "" ;
        domain:coordinates = "time" ;
    double time ;
        time:units = "days since 2020-09-24" ;

@davidhassell
Copy link
Contributor Author

... the presence of a dimensions attribute on a scalar variable will identify the
variable as a domain variable.

There is some confusion here ... Are we saying that a domain variable must be a scalar? We don't insist on that for grid mapping and geometry variables (although it is recommended for grid mapping variables). If we say that for a domain variable, we should say the same for grid mapping and geometry variables - which I think would be a defect change (on the grounds that these variables were never intended to contain meaningful data arrays).

Either way, this brings me back to @JonathanGregory's suggestion of disallowing dimensions as a data variable attribute, which I now think is OK (for CF versions starting at the one in which the domain variable goes in). So that first sentence could become

The dimensions of the domain must be stored with the **`dimensions`**
attribute, and the presence of a **`dimensions`** attribute identifies the
variable as a domain variable. Therefore the *`dimensions`** attribute must
not be present on any variables that are to be interpreted as data variables.

The phrase variables that are to be interpreted as data variables means that variables that are referenced by data variables (such as auxiliary coordinate variables) may indeed have a dimensions attribute, but in that case such a variable cannot also be used as independent data variable (see, for example, the section "Interpreting CF-netCDF files" in https://doi.org/10.5194/gmd-10-4619-2017).

@AndersMS
Copy link
Contributor

Dear @davidhassell ,
I also support this proposal. This will address some of the uses cases discussed in cf-convention/discuss#37, Standard way to define subsampled coordinates,, including the need for tools to be able to pre-process coordinates in a meaningful manner, without needing to access the data variables.

@AndersMS
Copy link
Contributor

@davidhassell

A couple of comments on the proposed text:

Possibly it could be made clearer in the text that multiple domain variables may exist in a file. The text uses the plural form in a few places, like in the heading 5.8 Domain Variables, but mostly uses the singular form. The current CF convention document uses plural in most places when describing variables, attributes and dimensions.

Also, does any particular restrictions apply when having multiple domain variables? I would assume that for a particular domain, only one domain variable is permitted?

@JonathanGregory
Copy link
Contributor

Dear @davidhassell
You're right that we don't require container variables to be scalar, and I suppose we don't need to, but I assume that they would normally be scalar, since any data they contain is a waste of space. So "the presence of a dimensions attribute on a variable" indicates it's a domain variable. Thanks. My suggestion for disallowing domain variables for scalar domains is that the main purpose of the domain variable is to declare the domain without having to include any data. Since a scalar variable contains no more data anyway than a domain variable, that argument doesn't apply.
I can't see a need for more than one domain variable to describe any domain, but it seems harmless. A domain is defined by the things which are attached to a domain variable. Hence two domain variables describing the same domain would by definition have equal contents. That is redundant but doesn't cause a problem, just as one might have a copy of a data variable.
Cheers
Jonathan

@davidhassell
Copy link
Contributor Author

Dear @AndersMS

Thanks (and to @erget) for the reference to the subsampled coordinates issue.

Possibly it could be made clearer in the text that multiple domain variables may exist in a file. The text uses the plural form in a few places, like in the heading 5.8 Domain Variables, but mostly uses the singular form. The current CF convention document uses plural in most places when describing variables, attributes and dimensions.

That would be fine. The last sentence in the new text could changed to read:

"Multiple domain variables may exist in a file, with or without other data
variables. Note that the data variable attributes describing its
domain can not be replaced by a reference to a domain variable."

Also, does any particular restrictions apply when having multiple domain variables? I would assume that for a particular domain, only one domain variable is permitted?

As @JonathanGregory says, this is no problem. I don't think that this needs special mention, as this is has always been true of all types of variables.

@JonathanGregory and all - I'll update the PR next week for the various new bits of text.

@AndersMS
Copy link
Contributor

Hi @davidhassell and @JonathanGregory,

As @JonathanGregory says, this is no problem. I don't think that this needs special mention, as this is has always been true of all types of variables.

From a user point of view and for the ease of discovering the domains, it would appear attractive if:

  • there is only one domain variable per domain
  • if domain variables are used in the file, all domains must have a domain variable

It would also support better the use case stated by @oceandatalab for accessing coordinate variables without accessing data variables under Standard way to define subsampled coordinates #37.

For other data variables, we do need to permit multiple instances for things to work, and cannot excluded that some of these are copies of each other.

If there is no similar need for permitting copies of domain variables, I guess it would be cleaner and more beneficial not to permit copies.

@davidhassell
Copy link
Contributor Author

Hi @AndersMS,

I don't see the use-case for restricting a dataset to have at most one domain variable, as datasets already can contain multiple implicit domains defined by data variables, so I think it makes sense to mirror that situation.

I open to saying that a file must contains either data variables or domain variables, but never both. What do others think?

@AndersMS
Copy link
Contributor

AndersMS commented Sep 28, 2020

Hi @davidhassell,

It was not my intention to propose only one domain variable per file, but to suggest one domain variable per domain (so no copies) :-)

So the if

  • there is only one domain variable per domain

and

  • if domain variables are used in the file, then all domains must have a domain variable

then a user can easily search a file for all domain variables and the resulting list of domains will be complete and without copies. It would just appear convenient for discoverability.

@davidhassell
Copy link
Contributor Author

Hi @AndersMS,

OK, I see. However, I don't see how we can enforce that any two domain variables in a dataset refer to different domains. It may be desired and of note to have multiple domains, some of which happen to be equal. We would also have to define "equal", which is another problem ...

I think that this is comes down to a user community choice - for example, a project could insist that, for its outputs, a dataset containing domains must contain only one. This would similar to, say, the CMIP project which favours only one data variable per dataset - a local restriction that goes beyond CF but is useful to its users.

davidhassell added a commit to davidhassell/cf-conventions that referenced this issue Sep 29, 2020
@davidhassell
Copy link
Contributor Author

Should we instead identify a domain variable by having a cf_role attribute, with value "domain"? This would remove any ambiguity about what, or isn't, a domain variable.

@JonathanGregory
Copy link
Contributor

It seems to me that the presence of a dimensions attribute is a sufficient and clear indication that it's a domain variable. Using cf_role as well would introduce redundancy and hence probable inconsistency.

@davidhassell
Copy link
Contributor Author

Thanks, @JonathanGregory.

OK - that's fine by me. It's a good point about redundancy.

@AndersMS
Copy link
Contributor

AndersMS commented Oct 7, 2020

Hi @davidhassell

I don't see how we can enforce that any two domain variables in a dataset refer to different domains. It may be desired and of note to have multiple domains, some of which happen to be equal.

Thank you for the reply, I agree that it is better to keep that flexibility and withdraw my proposal regarding a single domain variable per domain.

@davidhassell
Copy link
Contributor Author

Hello, I wrote before:

I open to saying that a file must contains either data variables or domain variables, but never both. What do others think?

I have since learned that there is neither a current desire nor use-case for restricting a mixture of domain and data variables, so I withdraw the suggestion.

@oceandatalab
Copy link
Contributor

oceandatalab commented Oct 14, 2020

We are also in support of @davidhassell 's proposal.

Several field variables may share the same domain (output parameters computed on the same grid for numerical model simulations, measurements and derived geophysical data acquired by an instrument in remote-sensing, etc...) but the current conventions define the domain only as an abstract concept which is implemented with attributes on the field variables: in order to identify the domains available in a file (one of our use cases), you have to analyze all the field variables available in the file, parse their attributes to extract domain-related information and then compare the extracted domains to remove duplicates. So listing domains is possible today but it sure is more involved than it should.

Materializing the domain variables as proposed here would make this process a lot easier and probably result in a clearer description of the data.

In the changes proposed in #302, it is stated that:

The constructs contained by the field and domain constructs cannot exist independently, with the exception of the domain construct itself that may be part of a field construct or exist on its own

and

In CF-netCDF, domain information is stored either implicitly via data variable attributes (such as coordinates), or explicitly in a domain variable. In the latter case, the domain exists without reference to a data array.

Does it mean that when a domain construct is part of a field construct it has to be stored exclusively via attributes (as it is done with the current conventions) or is it possible to also have a reference to a domain variable?

Something like the following pseudo-CDL:

float some_field1(time, track, scan);
  : coordinates = "time lat lon";
  : domain = "some_domain";

float some_field2(time, track, scan);
  : coordinates = "time lat lon";
  : domain = "some_domain";

int some_domain;
  : dimensions = "time track scan";
  : coordinates = "time lat lon";

// Attributes for these variables have been omitted for conciseness
float lon(track, scan);
float lat(track, scan);
double time(time);

It would still be compatible with existing software because domain information remains available as attributes on the field variables, but it would also clarify the relation between a domain variable and the field variables that use this domain.

@davidhassell
Copy link
Contributor Author

Hi @oceandatalab,

Thank you describing your use case that would be benefited by a domain variable.

Does it mean that when a domain construct is part of a field construct it has to be stored exclusively via attributes (as it is done with the current conventions) or is it possible to also have a reference to a domain variable?

I am proposing that domain variable references should not be allowed from a data variable. This is to preserve backwards compatibility and to avoid redundancy (in the senses of design principles 10 and 6).

We must be careful not to confuse CF data model constructs with netCDF variables - the data model has been designed to be independent of the netCDF encoding. In the modified data mode proposed here, a field construct may contain a domain construct, but that in no way forces the netCDF representation of the field construct to contain an explicit reference to a domain variable.

Thanks,
David

@oceandatalab
Copy link
Contributor

Sorry for the confusion between field/data and construct/variable.

My question was about having a reference to the domain variable in addition to the attributes that already describe the domain (implicitly) on the data variable so I am not sure how it would break backwards compatibility.

I agree that it introduces some redundancy, but I would argue that is already the case when several data variables share the same domain and each of these data variables defines this very same domain implicitly with their attributes. Allowing a reference to the domain variable on data variables would add a way to check that the attributes on these data variables (that are meant to describe the same domain) are consistent with each other.

@davidhassell
Copy link
Contributor Author

Hi @oceandatalab OK - we're in a slightly grey area here! This is where the design principles can really help.

Principle 6 says

"To avoid potential inconsistency within the metadata, the conventions should minimise redundancy."

and principle 10 says

"... there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one)."

So to minimise redundancy, we should not allow both a domain variable reference and the other data variable attributes to exist at the same time; and we shouldn't allow a domain variable variable reference anyway because we already have adequate (even if improvable) means of conveying the same information.

My original comments about backwards compatibility weren't strictly right, I realise. Allowing a domain variable reference instead of the usual data variable attributes would not be a CF backward compatibility issue (though it would be a little tough on software writers), but it would fall foul of principle 10.

Thanks,
David

@dblodgett-usgs
Copy link
Contributor

Point of order, I updated the moderator comments in the description above.

@davidhassell
Copy link
Contributor Author

Thanks for the summary, Dave.

@oceandatalab - are you OK with not allowing a domain variable reference from a data variable?

There hasn't been any comment on the changes to the text of the data model. It would be great if someone could review the suggested changes to appendix I in PR #302.

All the best,
David

@JonathanGregory
Copy link
Contributor

If I'm reading the right thing, the text of Appendix I contains the statement "It is not a construct of the data model, but is an abstract concept that is useful for understanding it." That should be deleted now (since that's the whole point 😄 )

@davidhassell
Copy link
Contributor Author

I'm not sure what's going on here, but were you reading the rich diff? That seems to be having intermittent difficulties in showing the modified image caption (where that text was deleted from), and also isn't showing the modified image). The side-by-side diff is OK, though, I think.

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Oct 20, 2020 via email

@oceandatalab
Copy link
Contributor

@davidhassell Sorry for the delay I just came back from vacation.

My original comments about backwards compatibility weren't strictly right, I realise. Allowing a domain variable reference instead of the usual data variable attributes would not be a CF backward compatibility issue (though it would be a little tough on software writers), but it would fall foul of principle 10.

I think you meant that allowing a domain variable reference in addition to the usual data variable attributes would not be a CF backward compatibility issue, whereas replacing the usual attributes by a reference to a domain variable would break backward compatibility.

Allowing a domain variable reference from a data variable is not strictly necessary for our use case, so I do not consider this to be a blocking point. However, I still think it should be discussed because I am not sure rule 10 applies here:

there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose

For me the reference to the domain variable does not serve the same purpose as the usual data variable attributes because this reference is meant to identify the domain uniquely, and this information is not provided by the usual attributes, so I would consider the reference as additional information, not a replacement/competitor.

Being able to clearly identify the domain of a data variable, and therefore the data variables that share a domain, is definitely an operation that could be made simpler and this goal could be achieved very easily by a domain variable reference. If the reference is an issue due to its nature, then one could simply replace it by a unique identifier string, but if domain variables are available then it would be a shame not to use them for that purpose too.

@JonathanGregory
Copy link
Contributor

Dear @oceandatalab

I think that allowing data variables to refer to the domain with a single reference instead of providing the domain information by various references on the data variable would be a drastic change to the convention. Although not backwards incompatible in the sense that it wouldn't invalidate existing conventions or data, it would require all software to be rewritten to support this different method. I think that would be a bad decision. Alllowing a domain reference in addition to the other means of describing the domain by the data variable would be redundant, and therefore potentially inconsistent, which also doesn't sound good to me.

I understand your argument that you want to use the domain reference as a way to identify the domain uniquely, but I would argue that you can't really depend on that method. It will only work within a single file (within which one can depend on variable names as references) and netCDF datasets aren't necessarily contained in single files. Hence you still need to be able to decide whether domains are equal by inspecting the metadata and coordinates. You would have to be able to do that also if assembling a dataset from various sources.

Best wishes

Jonathan

@oceandatalab
Copy link
Contributor

oceandatalab commented Oct 28, 2020

Hi @JonathanGregory

I think that allowing data variables to refer to the domain with a single reference instead of providing the domain information by various references on the data variable would be a drastic change to the convention. Although not backwards incompatible in the sense that it wouldn't invalidate existing conventions or data, it would require all software to be rewritten to support this different method. I think that would be a bad decision.

I never suggested to use a single reference instead of the usual data variable attributes. This is something that is only mentioned in #301 (comment) and I think it was just due to a misunderstanding or a typo. So we agree that breaking backward compatibility would be a bad idea.

Alllowing a domain reference in addition to the other means of describing the domain by the data variable would be redundant, and therefore potentially inconsistent, which also doesn't sound good to me.

Here we disagree:

  1. without this proposal, if there is a single domain shared by several data variables and each data variable describes this domain with the usual attributes, you already have redundancies as the single domain is described on each data variable, and there is no way to check that the domain description is consistent among these data variables so the inconsistency risk is quite high.

  2. with this proposal but without the domain variable reference that I mentioned, we gain the ability to access domain information directly (which is a very good thing) but we create an additional description of the domain, which could conflict with the description provided on data variables (that may already conflict with each other as seen in 1.). So the risk of inconsistency is slightly higher than in 1.

  3. with this proposal and with the domain variable reference, you still have all the redundant descriptions that were already there in 1. and 2. but now you have a tool that allows you to automatically detect consistency issues that arise from the pre-existing redundancy problem. So from my point of view you not only get a clearer description of the data but also a way to validate domain information across redundant definitions (which could be implemented or not in a software for automatic validation).

I understand your argument that you want to use the domain reference as a way to identify the domain uniquely, but I would argue that you can't really depend on that method. It will only work within a single file (within which one can depend on variable names as references) and netCDF datasets aren't necessarily contained in single files. Hence you still need to be able to decide whether domains are equal by inspecting the metadata and coordinates. You would have to be able to do that also if assembling a dataset from various sources.

I admit I have no experience with multi-file netCDF datasets so I may not fully grasp all the implications that adding the domain variable reference would have on this data structure. I quickly browsed the NcML documentation and it seems to allow the creation and modification of attributes on the variables of the multi-file dataset, so someone who wants to aggregate files from several sources could write a NcML file that correctly defines the domains and their references in the view offered by the multi-file dataset. But again, I have never worked with this kind of datasets so I may be completely wrong.

Cheers,

Sylvain

@davidhassell
Copy link
Contributor Author

Hi Sylvain,

My original comments about backwards compatibility weren't strictly right, I realise. Allowing a domain variable reference instead of the usual data variable attributes would not be a CF backward compatibility issue (though it would be a little tough on software writers), but it would fall foul of principle 10.

I did indeed mean "instead of" rather than "in addition to". Allowing a domain variable reference instead of the usual attributes would neither disallow the usual attributes, nor change their meaning, so no backwards incompatibility. This is similar to the grid_mapping extension that was introduced at CF-1.7. In this case the old single grid mapping case was still supported in the new version, but a new syntax was created for multiple grid mappings. This new syntax is not understood by software built on CF-1.6.

Being able to clearly identify the domain of a data variable, and therefore the data variables that share a domain, is definitely an operation that could be made simpler and this goal could be achieved very easily by a domain variable reference. If the reference is an issue due to its nature, then one could simply replace it by a unique identifier string, but if domain variables are available then it would be a shame not to use them for that purpose too.

We shouldn't allow a domain variable instead of the usual domain definition because a) there was no use case for it and b) because it would require all software to be rewritten to support this different method. Even though allowing this would make it easier, in limited circumstances, to see "by eye" if two data variables shared a domain, I don't think that is a use case on its own. These limited circumstances only arise when informally comparing multiple data variables with domain references within the same file (as opposed to the same dataset). Library software would not generally benefit from this as it has to store the constituent parts of the domain (cell measure, grid mappings, coordinates, etc) regardless of how it was encoded. If a stronger use case were to present itself in the future I would welcome this being reviewed, but suggest that for now we do not allow this.

With regards the pre-existing redundancy issue, data variables are essentially independent entities. Therefore there is no redundancy if, say, two data variables have the same coordinates attribute value. We have to trust dataset providers to produce the datasets that they intend, and that is made easier by not allowing the same information to be encoded twice for each data variable. If this were allowed, and the two methods were inconsistent, we have no way of knowing which is correct.

Anyway, I think (if I've read everything correctly) we are in agreement that a domain variable variable reference should not be used in addition to nor instead of the usual domain definition (data variable coordinates attribute, etc). Which is good for the progress of this issue.

Thanks,
David

@oceandatalab
Copy link
Contributor

Hi David,

I did indeed mean "instead of" rather than "in addition to". Allowing a domain variable reference instead of the usual attributes would neither disallow the usual attributes, nor change their meaning, so no backwards incompatibility. This is similar to the grid_mapping extension that was introduced at CF-1.7. In this case the old single grid mapping case was still supported in the new version, but a new syntax was created for multiple grid mappings. This new syntax is not understood by software built on CF-1.6.

Ok, it was confusing because no one talked about using references to domain variables instead of the usual attributes before, so I thought you were replying to my in addition question.

We shouldn't allow a domain variable instead of the usual domain definition because a) there was no use case for it and b) because it would require all software to be rewritten to support this different method.

Agreed.

Even though allowing this would make it easier, in limited circumstances, to see "by eye" if two data variables shared a domain, I don't think that is a use case on its own. These limited circumstances only arise when informally comparing multiple data variables with domain references within the same file (as opposed to the same dataset). Library software would not generally benefit from this as it has to store the constituent parts of the domain (cell measure, grid mappings, coordinates, etc) regardless of how it was encoded. If a stronger use case were to present itself in the future I would welcome this being reviewed, but suggest that for now we do not allow this.

With regards the pre-existing redundancy issue, data variables are essentially independent entities. Therefore there is no redundancy if, say, two data variables have the same coordinates attribute value.

I get what you mean, but independence achieved by denormalization introduces redundancy as soon as two entities have some elements in common, and therefore makes the data prone to inconsistency issues. Even if each data variable has its own domain instance (i.e. its own set of coordinates, grid_mapping, etc... attributes) , if two or more data variables share the same domain (multiple parameters measured by the same instrument for example) then these instances of the domain are redundant, I don't see how it could be otherwise.

We have to trust dataset providers to produce the datasets that they intend, and that is made easier by not allowing the same information to be encoded twice for each data variable. If this were allowed, and the two methods were inconsistent, we have no way of knowing which is correct.

The idea is not to identify which definition is correct but to detect when two definitions of the same domain are incompatible or not as complete as they could. The goal is to offer a way for data producers to detect errors (multiple definitions of a single domain that are not compatible with each other) and consistency issues (when two variables share the same domain but one of them only provides a minimal definition while the other has a detailed description), therefore the means to improve the overall quality of the files they generate before these files are distributed to end users.

But again, it was just a suggestion for a small improvement of the proposal, it is absolutely not a blocking point for us.

Cheers,

Sylvain

@davidhassell
Copy link
Contributor Author

Thanks for all of the discussion. I understand (from these comments and off-line conversations) that there are no objections to the pull request as it stands. @dblodgett-usgs would you agree?

It would be still be good to get some comment here on the data model changes.

Many thanks,
David

@dblodgett-usgs
Copy link
Contributor

dblodgett-usgs commented Nov 16, 2020

I agree @davidhassell and I don't think the subsequent conversation warrants any further summary above. Thanks for the good conversation all.

@erget
Copy link
Member

erget commented Nov 17, 2020

I've had a look at the latest draft and still support this proposal. The changes are mostly straightforward as they enshrine as a construct what was until now a concept that has served the community well. Thank you @davidhassell for the painstaking work here, I believe this will be a benefit to the community.

AndersMS added a commit to AndersMS/cf-conventions that referenced this issue Aug 3, 2021
* added example 6.1.2 to the list of examples; fixed cf-convention#284

* updated changes in history.adoc

* removed fourth lines of third table in sect 9.3.1; fixed cf-convention#288

* updated history

* Bring conformance doc in line with clarification to use of region names/area_types to allow use of flag_values and flag_meanings as per discussion in cf-convention#198

* Add support for variables of type string to conformance doc.  See issue cf-conventions#139

* Revert "Bring conformance doc in line with clarification to use of region names/area_types to allow use of flag_values and flag_meanings as per discussion in cf-convention#198"

This reverts commit f754457.

* first draft of section 5.8

* format typo

* rewording

* rewording

* rewording

* New 'Do' Use value, and 'dimensions' entry

* Domain construct

* rewording

* rewording

* rewording

* formatting of computed_standard_name entry

* rewording

* rewording

* rewording

* top-level

* rewording

* move fig 3

* rewording

* span

* rewording

* data

* rewording

* rewording

* rewording

* conformance

* recommended attributes

* typo

* dimensions

* dimensions

* format

* typo

* domain independence

* domain optional

* format

* format

* format

* format

* empty dimensions

* long_name

* UML

* Update ch01.adoc

* Update history.adoc

* Add static assets to HTML check build

* Add static assets to Travis upload job

* Fix order of i/j in lon/lat bnds figure
correct indices of neighbour cells in @d case

* update/correct order of indices i/j in Fig 2 (2D lon/lat bounds)
* update/correct order of indices i/j in caption of Fig 2
* rename "figure 1" to "figure 3" in Appendix i
* correct indices of neighbour cells in @d case
* update history

Figures are generated from:
https://github.com/neumannd/cell_bounds_figures_for_cf_conventions

* updates arising from cf-convention#301 up to 2020-09-28

* correct label for 1.2

* format correction

* reword empty dimensions example

* comma

* example links

* long_name

* formatting

* missing 'construct'

* term units

* term units

* standard names

* typo

* units conformance requirement

* remove requirement for identical units

* Copyedit

* fixed typos

* History

* more text following 2020-11-27 discussions

* bounds

* tidy

* tidy

* tidy

* tidy

* reproducability

* offset

* indices

* indices

* indices

* super

* tie_point_dimension (1)

* tie_point_dimension (2)

* tie_point_dimension (3)

* tie_point_dimension (4)

* tie point

* tie_point_dimension (5)

* corrected interpolation_configuration description

* zone/area rewording

* zone/area rewording

* multiple mappings

* multiple mappings

* multiple mappings

* typos and some minor rewording suggestions

* format

* spell check

* markup style

* example formatting

* example formatting

* example formatting

* example formatting

* minor typesetting

* interpolation_parameters

* interpolation parameters variable dimensions

* interpolation parameters variable dimensions

* non-standard provision

* interpolation parameters variable dimensions

* captions, cdl

* tidy

* minumum size of interpolation zones

* Appendix A attributes

* interpolation -> sampling

* Conformance - first draft

* 2nd draft: better descriptions of allowed dimensions

* typos

* Correct 'is list' to 'is a list'

* history cf-convention#304

* check on interpolation zone dimension size

* Clarification of the handling of leap seconds

This is the suggested initial wording from cf-convention#313 as authored by
@JonathanGregory.

* leap seconds: added the word "count" in some places

The purpose of this change is to slightly highlight the difference
between when seconds are used within the coordinate value for counting
and the seconds which are part of the date-time.

* leap seconds: minor wording extension

* leap seconds: added reference to cf-convention#313 to history.adoc

* add myself to the end of the list of additional authors

* leap seconds: updated conformance text

This change excludes values larger or equal to 60 for seconds in
reference date-times in time unit attributes.

Additionally, the reference time has been changed to reference
date-time to agree with the wording in the proposed conventions text.

* leap seconds: small rewording as discussed with @JonathanGregory

Reasoning: counting may be associated with integral numbers, which is
was not intended. We still like the idea of a little more separation
between seconds as a unit of the value and seconds as in the date-times.

* replace date-time with date/time

* conformance changes for new interpolation variable

* conformance changes for new interpolation variable

* conformance changes for new interpolation variable

* conformance changes for new interpolation variable

* appendix A changes for new interpolation variable

* appendix A changes for new interpolation variable

* lat lon tie point definition

* spelling

* URI -> URL

* lower resolution -> sampled

* Use on domain variable

* typo

* Move 'interpolation dimension' definition to first occurence

* Minor re-wording

* Fix cross-reference

* Re-wording

* typesetting

* tie point index re-wording

* Rotation of interpolation axes for two dimensional methods and mino corrections

* terminology: interpolation variable and tie point variable

* typo

* examples in toc

* Replace expression for gsqr with equivalent, but numerically more accurate version

* Update authors

* Update history

* Rename attribute tie_points to coordinate_interpolation (Change 2)

* Reword section Interpolation and Non-Interpolation Dimensions (Cahnge 10)

* Rename tie_point_dimensions attribute to tie_point_mapping (Change 2)

* Change term 'tie point variable' to 'tie point coordinate variable' (Change 4)

* Reword first paragraph of Section 8 (Change 6)

* Remove sentence "This form of compression may also be..." (Change 7)

* Update sentence: "A single interpolation dimension may be associated..." (Change 9)

* Reword section "Interpolation and non-interpolation dimension" (Change 10)

* Improve sentence "An interpolation zone must span at least two points..."  (Change 11)

* Correct sentence  "....must be a subset of zero or more of the ..." (Change 12)

* Reword text about the dimensions of interpolation parameter (Change 13)

* Improve sentence "The bounds of a tie point must be the same..." (Change 14)

* Reduce number of data variables in Example 8.5 (Change 16)

* Rename "terms to continuous area" and "interpolation subarea" (Change 5)

* Improve wording of "An interpolation subarea must span..." (Change 11)

* Remove paragraph "The same interpolation variable may be multiply mapped ...." no longer relevant

* Rename terms to: subsampled dimension, interpolated dimension and non-interpolated dimension

* Combine the tie_point_dimensions and tie_point_indices attributes (Change 1)

* Update figures to match new terms

* Improve description of non-overlapping interpolation subareas

* Improve description of non-overlapping interpolation subareas

* Update Example 8.6 to correctly specify one dimension interpolation for X and Y

* Improve wording of Tie Point Index Mapping (Change 8)

* Clarify interpolation subarea size

* Clarify dimensions in Figure 2

* Add new section 8.3.9, "Computational Precision"

* Combine the tie_point_dimensions and tie_point_indices attributes (Change 1)

* Remove paragraph "A single interpolated dimension may be associated with multiple  ...." no longer relevant

* Update ch08.adoc

Co-authored-by: David Hassell <[email protected]>

* Update ch08.adoc

Co-authored-by: David Hassell <[email protected]>

* Update ch08.adoc

Co-authored-by: David Hassell <[email protected]>

* Update ch08.adoc

Co-authored-by: David Hassell <[email protected]>

* Change sampl... to subsampl...

* Rewrite section Interpolation of Cell Boundaries (Change 15)

* Constrain interpolation parameters to support bounds interpolation

* Update <<link>> names and figure names to new terms

* Require tie points to be numeric type and have no missing values

* Update Appendix J with new terms and names

* Correct spelling mistake in Appendix J

* Correct numbering mistake in Appendix J

* Change "iz" (interpolation zone) to "is" (interpolation subarea) in App J (Change 3)

* Correct "target dimension" to "interpolated dimension" (Change 17)

* Introduce section numbering and remove table captions in Appendix J

* Include interpolation argument s in figure 1 and 2

* Move Figure 1 and 2 in Appendix J futher down

* State tht for linear interpolation, the coordinates of the interpolated points are evenly spaced.

* Change "equivalently" to "similarly" in explanation of s1 and s2 in App J

* Rename cofficeint "c" to "w" in Appendix J to avoid confusion with point C

* Move "Common Conversions and Formulas" in front of "Interpolation Methods" in Appendix J

* Add "s" to "each of the interpolated dimension" in Appendix J

* Minor wording improvements arising from review

* Conformance for bounds tie points

* computational_precision conformance

Co-authored-by: Daniel Neumann <[email protected]>
Co-authored-by: Rosalyn Hatcher <[email protected]>
Co-authored-by: JonathanGregory <[email protected]>
Co-authored-by: Daniel Lee <[email protected]>
Co-authored-by: Daniel Lee <[email protected]>
Co-authored-by: David Blodgett <[email protected]>
Co-authored-by: AndersMS <[email protected]>
Co-authored-by: Tobias Kölling <[email protected]>
Co-authored-by: Tobias Kölling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants