Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should cf_role be deprecated in favor of standard_name? #430

Closed
dblodgett-usgs opened this issue Feb 14, 2023 · 10 comments · Fixed by #434
Closed

Should cf_role be deprecated in favor of standard_name? #430

dblodgett-usgs opened this issue Feb 14, 2023 · 10 comments · Fixed by #434
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@dblodgett-usgs
Copy link
Contributor

dblodgett-usgs commented Feb 14, 2023

Dear CF community,

Based on recent conversation and a number of experiences where this caused confusion, I wonder if it would be wise to deprecate the cf_role attribute in favor of extended standard_name attributes? Scanning the spec for instances of cf_role I don't see any cases where a standard_name couldn't be used instead.

I ask because I have been confused about the purpose (role) of standard_name and the cf_role attribute. It seems that someone introduced cf_role with a separation of concerns between functional and quantity type in mind. Other people who have contributed did not continue using that separation of concerns for other parts of the specification.

Maybe I'm missing something that forces the inclusion of cf_role but none of the examples show use of both cf_role and standard_name so the reason for the additional functional descriptor is not clear. If we are to keep cf_role and standard_name, it would be useful to document the need for both more clearly in the specification and examples?

Regards -- Dave

@JonathanGregory writes: Following discussion, Dave and I have proposed a change in order to clarify the purpose of cf_role. We think this change would correct a defect in the convention text but would not be a material change to the convention.

@JonathanGregory writes: Pull request 434 implements this change.

@mwengren
Copy link
Contributor

The only use I'm familiar with is for DSG datasets, Chapter 9.5 and all of the related examples in Appendix H that you're probably referring to.

I just know that cf_role has been adopted by at least one well-used software package I'm familiar with, ERDDAP, in making decisions on how to handle different types of CF DSG datasets.

And googling just now to check xarray usage led me to cf-xarray, which is newer and maybe more easily disentangled if cf_role were deprecated in the future. Either way, downstream impacts on community software should be considered.

Personally, I like the separation/distinction from standard_name because DSG is fairly major component of the CF spec, but that's mostly just a stylistic opinion, functionally it doesn't seem necessary, you're right.

@ocefpaf
Copy link

ocefpaf commented Feb 14, 2023

Scanning the spec for instances of cf_role I don't see any cases where a standard_name couldn't be used instead.

I do believe that both ugrid and sgrid standards, that try to be cf-compliant, use the cf_role extensively in their specifications. A deprecation cf_role would impact those two standards as well.

@JonathanGregory
Copy link
Contributor

Dear Dave @dblodgett-usgs

Thanks for asking the question. I think the motivation for introducing cf_role was to provide a way to indicate which of the coordinate variables of the "instance" dimension (which runs over features i.e. stations, profiles, etc.) should be regarded as a unique identifier of the feature. As far as I know that's its only function in CF.

The variable with the cf_role attribute could also have a standard name, if we defined standard names for station names, etc. There might then be more than one possible standard name for the variable, and the standard name could say something more specific about its contents e.g. that it's either station names or station numbers, but it's not obvious to me this would be useful unless the possible names or numbers were somehow standardised, which I don't think anyone's suggested doing.

Also, it's not obvious that cf_role is really necessary for this purpose, because you could assume that the coordinate variable with the name of the dimension contains the unique identifiers i.e. a coordinate variable in the Unidata sense, not a CF auxiliary coordinate variable. The values of a coordinate variable must be unique. However, if the cf_role variable is string-valued, CF does not currently permit it to be a coordinate variable, only an auxiliary coordinate variable, partly because most of CF was written before string arrays were possible. We could revisit this.

When discussing this, we should also keep in mind our principle 10 (in section 1.2), "Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one)."

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

A few notes on the UGRID, which will be in CF-1.11. As far as I know, the only mandatory use of cf_role in UGRID is for identification of the mesh topology variable:

integer Mesh2 ;
    Mesh2:cf_role = "mesh_topology" ;

But the UGRID examples make extensive optional use of the attribute on connectivity variables, e.g.

integer Mesh2_face_nodes(nMesh2_face, Three) ;
    Mesh2_face_nodes:cf_role = "face_node_connectivity" ;
    Mesh2_face_nodes:long_name = "Maps every triangular face to its three corner nodes." ;

There was a discussion about this over at #153, which ended in an agreement that the optional cf_role instances are to be deprecated, but the getting rid of the mandatory use on the mesh topology variable would break too many existing UGRID applications. See #153 (comment), and D) in #153 (comment).

For a mesh topology variable, if we were to deprecate cf_role, we wouldn't want a standard name to replace it, rather we could easily identify the mesh topology variable as such from other mandatory attributes (such as topology_dimension).

Thanks,
David

@dblodgett-usgs
Copy link
Contributor Author

Hi All -- I appreciate the pointers to the ugrid spec and its use of cf_role -- I personally like the use of cf_role and would rather we reserve standard_name for non-structural use cases (to give a quantity clear semantic meaning).

I guess the incorporation of ugrid, which makes use of cf_role (disregarding that it is optional for now) makes it even more important to clarify what it is in comparison to standared_names such as "forecast_reference_time".

I'm pretty sure we have two schools of design thought at play in the spec and just need to mention that this dichotomy exists? Or is there some logic for having a cf_role for the "timeseries_id" that we would describe as an intentional CF design pattern?

Regards -- Dave

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Feb 15, 2023

Dear Dave @dblodgett-usgs

I agree that cf_role sounds like it should indicate a structural function, as in UGRID. I suspect it would not have been introduced for DSGs in the first place if string-valued coordinate variables had been possible. Its structural function is to indicate a variable which has a distinct value for every element. That's usually provided in a coordinate variable, which you can easily identify because its name is the same as the name of its dimension. Thus, cf_role is indicating the coordinate variable for the instance dimension, in effect, which could be numeric or string-valued. Unlike coordinate variables usually, the order of features along the instance dimension is probably not significant, whereas numeric coordinate variables have to be in monotonic order.

I think the contents of the cf_role attribute for DSGs are redundant, because the featureType attribute also indicates which sort of feature it is (time series, profile or trajectory), but its presence is extra information.

Best wishes

Jonathan

@dblodgett-usgs
Copy link
Contributor Author

Current Spec: (section 9.5)

Where feasible a coordinate or auxiliary coordinate variable with the attribute cf_role should be included. The only acceptable values of cf_role for Discrete Geometry CF data sets are timeseries_id, profile_id, and trajectory_id. The variable carrying the cf_role attribute may have any data type. When a variable is assigned this attribute, it must provide a unique identifier for each feature instance. CF files that contain timeSeries, profile or trajectory featureTypes, should include only a single occurrence of a cf_role attribute; CF files that contain timeSeriesProfile or trajectoryProfile may contain two occurrences, corresponding to the two levels of structure in these feature types.

I would suggest the first sentence be modified / expanded to read:

Where applicable, coordinate or auxiliary coordinate variable(s) should include a cf_role attribute that indicates the variable's role in the Discrete Geometry CF dataset. cf_role is included to make the functional role of some variables explicit. In other parts of the specification this explicit definition of role is accomplished with standard names.

That's not very satisfying, but at least it clarifies that there are two ways it's been done?

@JonathanGregory
Copy link
Contributor

Dave @dblodgett-usgs and I have exchanged emails about this. As a result, we would like to propose in Section 9.5 to change

Where feasible a coordinate or auxiliary coordinate variable with the attribute cf_role should be included. The only acceptable values of cf_role for Discrete Geometry CF data sets are timeseries_id, profile_id, and trajectory_id.

to

Where feasible, one of the coordinate or auxiliary coordinate variables of a discrete sampling geometry should have an attribute named cf_role. This attribute has no other function in the CF convention (despite its general-sounding name), and its only permitted values are timeseries_id, profile_id, and trajectory_id.

This change to the text would not alter the meaning of the convention. Its purpose to clarify the purpose of the cf_role attribute, so we think it's a correction of a defect. Therefore this proposal will be accepted in three weeks from now (on 31st March) if no-one raises concerns.

Thanks

Jonathan

@JonathanGregory JonathanGregory added defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors and removed question labels Mar 10, 2023
@dblodgett-usgs
Copy link
Contributor Author

Thanks for following this up @JonathanGregory -- I think this is a good all around solution.

@JonathanGregory JonathanGregory linked a pull request Mar 31, 2023 that will close this issue
@JonathanGregory JonathanGregory added the change agreed Issue accepted for inclusion in the next version and closed label Mar 31, 2023
@JonathanGregory
Copy link
Contributor

Three weeks have passed with no further comment, so this change is accepted. Dave @dblodgett-usgs, please could you check the PR and merge if satisfactory? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants