Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new integer types to CF #243

Closed
ethanrd opened this issue Feb 10, 2020 · 27 comments · Fixed by #244 or #294
Closed

Add new integer types to CF #243

ethanrd opened this issue Feb 10, 2020 · 27 comments · Fixed by #244 or #294
Assignees
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@ethanrd
Copy link
Member

ethanrd commented Feb 10, 2020

Title

Add new integer types to CF (From Trac ticket #166)

Moderator

@erget

Moderator Status Review [last updated: 10 Feb 2020 - @ethanrd ]

This enhancement was first suggested by @czender on the CF email list in September 2017 (here). That discussion was split and moved to two Trac tickets in Oct 2017:

  • Trac Ticket #166 (“Add new integer types to CF”)
  • Trac Ticket #167 (“Use of valid_range to indicate unsigned integers”)

After some discussion and a few minor changes, both Trac ticket 166 and 167 were, I believe, agreed upon without objection. Due to the move from Trac to GitHub the changes were not incorporated into the CF document. This GitHub issue addresses the remaining changes agreed on in Trac ticket #166.

[I will add a GitHub issue to address Trac ticket #167 shortly.]

Requirement Summary

Technical Proposal Summary

Add support in CF for the five “new” integer types supported by netCDF-4 (and now in netCDF-3 with the CDF5 encoding released in netCDF-C version 4.4.0). The five “new” integer types are: unsigned byte, unsigned short, unsigned int, int64, and unsigned int64.

Benefits

Explicit support for unsigned integer types

Status Quo

All integer types are assumed to be signed unless one of several workarounds/conventions are followed.

Detailed Proposal

Change the first paragraph of Section 2.2 "Data Types" as follows:

  • Add the five “new” integer types in the list of acceptable data types.
  • Add ‘unsigned byte’ to the sentence on one byte numeric data.
  • Remove the sentence stating that all integer types are treated as unsigned.
  • Modify language throughout document referring specifically to integers to allow more generic use of "integer types"

See PR #244

@ethanrd ethanrd added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Feb 10, 2020
ethanrd added a commit to ethanrd/cf-conventions that referenced this issue Feb 10, 2020
@ethanrd
Copy link
Member Author

ethanrd commented Feb 10, 2020

Meant to ping a few others involved in original discussions: @JimBiardCics @Dave-Allured @rsignell-usgs (didn't find user names for all from email discussion).

I wasn't involved in earlier discussions and happy to be moderator.

@Dave-Allured
Copy link
Contributor

Thanks Ethan. I support this proposal as currently written. It is overdue.

@czender
Copy link
Contributor

czender commented Feb 11, 2020

Me two.

@erget
Copy link
Member

erget commented Feb 11, 2020

@ethanrd I'm happy to moderate this if you like. I also support this proposal.

@erget erget self-assigned this Feb 11, 2020
@martinjuckes
Copy link
Contributor

Hi Ethan, I support this .. but we need to check other references to "integer". There are 9 statements in the convention that something "must be type integer" (e.g. "In this representation, the file contains an index variable , which must be of type integer, and must have the sample dimension as its single dimension"). Perhaps all occurrences of this phrase can be replaced with "must have an integer type"?

It might be worth stating the link between flag_values, which are expressed as, e.g. 1b, 2b, 4b, and the new types.

@JimBiardCics
Copy link
Contributor

I wholeheartedly approve of this proposal. @martinjuckes makes an excellent point regarding "must have an integer type".

@JimBiardCics
Copy link
Contributor

@martinjuckes Notice that the examples where flag_values are given as 1b, 2b, etc, have the 'b' indicator because the variables in the examples are of type byte. It would be good to change an example to use a different integer type so that this is clearer. We should also make sure the language in the text is clear about type agreement between the attribute and the variable.

@JimBiardCics
Copy link
Contributor

@ethanrd The CF documentation regarding constants of various types references the NUG, which gives some examples of CDL syntax, but which is by no means a rigorous description of how to represent each data type in an attribute. The ncgen man page has a discussion of this, but that sure seems like a multiply-removed place for a basic definition to reside. Perhaps we need to have the NUG or CF go into greater detail about this. (I think the NUG is probably the more appropriate place.)

@steingod
Copy link

I support this, it will be very welcome.

@erget
Copy link
Member

erget commented Feb 27, 2020

Summary

3 weeks have passed since any contribution and it seems we're pretty much at consensus. Supporting voices, including some from the Conventions Committee:

In my opinion one item should be addressed before this merge, however:
the issue raised by @martinjuckes concerning ensuring internal consistency in document. @JimBiardCics agrees with this concern. @ethanrd could you respond to this please?

@Dave-Allured
Copy link
Contributor

The original, simple proposal as currently drafted by @ethanrd will have no bearing on other defects or ambiguities in other references to "integer", flag values, or type agreement between attributes and variables.

For the sake of efficient process, I request that these other issues be split off as separate Github issues, as needed. I request consensus on the current draft with no further effort. I think this is in good shape to move. Thanks for your consideration.

@erget
Copy link
Member

erget commented Feb 28, 2020

As noted by @Dave-Allured, @ethanrd is attempting to remove restrictions on data types in general. This has broad support. The concern of @martinjuckes is that it would be possible to lift the current restriction on index variables, as is used for ragged arrays. However, this could also be addressed in a separate issue - under the current proposal, ragged arrays would still require integers, but we could change that in the future and in the meantime make use of the new types for other variables.

Is this OK for the group? Barring objections I will merge this after the three week deadline, namely 24 March 2020.

@JonathanGregory
Copy link
Contributor

I think readers of the document would probably want to understand "integer" to mean any integer type (any number of bytes, signed or unsigned). There is one reference to integer in regard to flag_values and flag_meanings, which must be of the same type as the data variable (whatever that is). All the others relate to discrete sampling geometries, in various roles. I think any integer type would serve those purposes as well.

I agree with @martinjuckes that it would be sensible to replace "type integer" with "any integer type" (or something like that) everywhere, and it would be helpful in sect 2.2 to list explicitly all the datatypes which we mean by "any integer type" in the document.

I too support the change being made.

@erget
Copy link
Member

erget commented Feb 28, 2020

@JonathanGregory @martinjuckes I agree that these are small changes and support both their implementation and the adoption of this (small) change as a whole. Unfortunately I'm about to be offline for several weeks so I won't be able to pursue implementing these changes myself. @Dave-Allured perhaps you'd be interested in contributing those to @ethanrd 's PR? If they've been implemented before my return and there are no further objections I am happy to merge the changes upon my return at the end of March.

@Dave-Allured
Copy link
Contributor

perhaps you'd be interested in contributing those to @ethanrd 's PR?

No thank you. I believe that the original proposal is sufficient and elegant. I would prefer to leave further tuning to those contributors who are unsatisfied and have changes to offer.

@ethanrd
Copy link
Member Author

ethanrd commented Feb 28, 2020

While I agree with @Dave-Allured on keeping PRs focused, I can also see @martinjuckes point that the "must be type integer" statements are linked to this PR. It is also a pretty minimal change, so I will update the "must be type integer" statements and push that change to this PR.

The flag_values section might, I think, need some discussion so I will break that into a separate issue/PR.

@ethanrd
Copy link
Member Author

ethanrd commented Feb 28, 2020

I am adding the following sentences to the end of the first paragraph in Section 2.2 "Data Types":

In many situations, any integer type may be used.
When the phrase "any integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.

The wording seems a bit awkward, so please make suggestions and I will update the PR.

@ethanrd
Copy link
Member Author

ethanrd commented Feb 28, 2020

Expanding to

... When the phrases "an integer type" or "any integer type" are used ...

Anything else? Or again, better wording all around?

@graybeal
Copy link

graybeal commented Mar 1, 2020

(Just an idea, take it or leave it.) I would add the following text after the next-to-last sentence in the paragraph (which first uses the phrase "integer types"): "In this document, the term "integer type" includes all of the following types: …" Could be in parentheses, or not.

I think that means the same as what your text expresses.

@martinjuckes
Copy link
Contributor

The definition of "integer types" could perhaps be worked into the first sentence, so as to avoid repeating the list of 8 types:

The netCDF external data types are supported: 8 integer types (byte, unsigned byte, short, unsigned short, int, unsigned int, int64, or unsigned int64), 2 floating point types (float, double), and two character types (string, char).

Is "supported" (taken from the existing convention text) the right word here? Perhaps it would be better to say "Data variables must be assigned one of the following NetCDF external data types: ..". NetCDF also allows user defined data types -- but I don't think we want to allow these in CF at this point (just because no use case has been proposed so far).

The last sentence of the first paragraph ("It is possible to treat the byte type as unsigned by using the NUG convention of indicating the unsigned range using the valid_min, valid_max, or valid_range attributes.") looks mis-leading to me. As we now have two explicit byte data types, I suggest we remove this.

@JimBiardCics
Copy link
Contributor

@ethanrd I like your verbiage. I think you could simplify it a bit by saying

In many situations, any integer type may be used.
When the phrase "integer type" is used in this document, it should be understood to mean byte, unsigned byte, short, unsigned short, int, unsigned int, int64, or unsigned int64.

This covers both "an integer type" and "any integer type".

@ethanrd
Copy link
Member Author

ethanrd commented Mar 10, 2020

@graybeal @martinjuckes @JimBiardCics - Thanks for your input.

@martinjuckes : I agree the first sentence in the Data Type section could be stronger. I would suggest simplifying your alternate sentence by dropping "assigned" and being more specific that these are all netCDF external data types supported by netCDF-4:

"Data variables must be one of the following data types: ... list of data types ... (which are all the netCDF external data types supported by netCDF-4)."

@martinjuckes
Copy link
Contributor

@ethanrd : yes, dropping "assigned" is a good idea and I like your latest reformulation.

@ethanrd ethanrd linked a pull request Mar 23, 2020 that will close this issue
@erget
Copy link
Member

erget commented Mar 26, 2020

It looks like consensus has been reached on the text as it stands - I'm starting the countdown and will merge this PR, barring objections or further changes, on 2020-04-16.

@erget
Copy link
Member

erget commented Apr 16, 2020

As noted in my previous comment, consensus has been reached on this proposal and the corresponding PR.

Support has been expressed by

All concerns expressed have been duly addressed and they have not been further pursued.

Therefore the prerequisites for acceptance are met and the 3 week think period is up; I am merging this PR. Thanks all for your contributions.

erget pushed a commit that referenced this issue Apr 16, 2020
* Add new integer types in CF 2.2 Data Types (#243)

* Update use of phrase "must be type integer" to "must have an integer type". Add explicit list of integer types in section 2.2.
@ethanrd
Copy link
Member Author

ethanrd commented Aug 28, 2020

Some final rewording agreed above (#243 (comment)) didn't get pushed to PR #244 before it was merged.
Reopening this issue and linking new PR (#294).

@czender
Copy link
Contributor

czender commented Sep 25, 2020

Thank you everyone for your efforts with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
9 participants