Stop using floats in time coordinate examples? #383

ChrisBarker-NOAA · 2024-10-16T16:43:43Z

ChrisBarker-NOAA
Oct 16, 2024
Collaborator

Topic for discussion

I've noticed that a lot of the output from oceanographic models (what I work with) uses a floating point dtype in the time coordinate, e.g.:

double time(time) ;
    time:long_name = "time" ;
    time:units = "days since 1990-1-1 0:0:0" ;

This is, of course perfectly CF compliant, but floating point is not the best choice for a time variable:

If single precision, you lose second precision after about 3 years (which has caused me problems).

If double precision, you have millions of years with second precision, but then you have too much precision -- downstream tools have no idea how many of those digits are important. (this is relevant to, say, xarray, which is currently using nanoseconds precision be default internally, which is then problematic when you write it out again, as many tools (e.g. cftime) don't handle nanoseconds).

Anyway, I think most would agree that floating point is not the the best data type for time. As far as I know, every time library uses integers. (except MS Excel, but I won't go there...)

This confused me, as internally, most (all?) models use integers (usually seconds) internally -- so why write it out as floating point days or hours??

I think I may know why:

Almost all the examples in the CF doc use floating point types for time, e.g.:

Example 4.4. Time axis

double time(time) ;
  time:long_name = "time" ;
  time:units = "days since 1990-1-1 0:0:0" ;

And that is the initial example.

I just searched through the entire document, and found only two examples that are integer time types:

Example 5.7. Lambert conformal projection

int time(time);
    time:long_name = "forecast time";
    time:units = "hours since 2004-06-23T22:00:00Z";

Example 7.15. Timeseries with geometry.

int time(time) ;
    time:units = "days since 2000-01-01" ;

and those are units of hours and days, so not suitable if you want sub-hour precision.

Some even use single precision (I think these are all in section 5: Coordinate Systems and Domain):

Example 5.21. A two-dimensional UGRID mesh topology variable

  // Coordinate variables
  float time(time) ;
    time:standard_name = "time" ;
    time:units = "days since 2004-06-01" ;

Ouch! I should have noticed that a long time ago! [1]

Anyway -- a lot of people (myself included) learn by example -- rather than reading the spec carefully, they look for an example that seems to be what they need, and copy that. Because of that, I think we really should have the example follow best practices -- and I don't think floating point for time is generally a best practice.

My proposal: we replace (many of) the examples of time coordinates in the CF doc with integer type examples.

This is not a change of the convention at all -- I don't think CF says anything about what data type to use for, well, anything, not even as a recommendation. (should it make recommendations?) -- but what is the barrier to entry for changing examples in this way?

[1] I had never really thought about this until recently -- I used various num2date() functions, and they just worked. But I really became aware when I noticed that when working with output from FVCOM (a unstructured mesh oceanographic model), I was getting timestamps like 12:00:18, when I should have gotten a nice round 12:00:00. and this made a mess because I wanted to use the results at 12:00:00, and got an out. of time bounds error. Anyway, turns out they were using float days since for teh time axis.

And now I think I may know why -- because that's the example in the UGRID example!

taylor13 · 2024-10-16T17:39:58Z

taylor13
Oct 16, 2024
Collaborator

Regarding:
Anyway, I think most would agree that floating point is not the the best data type for time. As far as I know, every time library uses integers.

I recognize that single precision floats are rarely a good choice for a time coordinate when storing climate data. I am not opposed to eliminating instances of the "float time" examples, but in most cases I would replace them with "double precision time" examples, rather than integers.

In climate research I find a double precision float representation of time best because I can express a time in days since with precision of at least seconds and yet for data spanning months to centuries, I can easily approximately convert that to months (or even years) by dividing in my head. If I were to represent the time as an integer with units of seconds, I would need to divide by ~302460*60 (whatever number that is) to get months. I would have to resort to a calculator to simply get an idea of what time period is spanned by the time coordinates.

If a library can't can't retain the precision of a double precision float in whatever conversions it makes, I'd say that library should not be used.

I would note that much climate data is stored at a precision much higher than is warranted by its accuracy. That, of course, means it occupies more storage than is necessary, but no one interprets the precision of data stored in a netCDF file as somehow implying that all its digits are significant.

0 replies

ChrisBarker-NOAA · 2024-10-16T17:50:38Z

ChrisBarker-NOAA
Oct 16, 2024
Collaborator Author

I can express a time in days since
...
I can easily approximately convert that

Hmm -- good point. units of days (or hours, depending) can be more human-readable, for sure.

Though I at least don't very often look at the raw values anyway -- ncdump will convert to ISO timestrings for you, and any of the computer reading tools are converting one way or another anyway. Frankly, the "time_unit since" representation isn't very human friendly at all.

Anyway, I'm not suggesting that we disallow, or even recommend against, using floating point for time.

But I do think it's not ideal to have it for ALL the examples.

Ironically, the only ones that use an integer time are hours or days.

Also -- do you really need second precision for climate work? (but I digress)

no one interprets the precision of data stored in a netCDF file as somehow implying that all its digits are significant.

Sure -- but the problem is that you don't know how many digits are significant -- most importantly, automated tools don't know.

0 replies

taylor13 · 2024-10-16T18:52:33Z

taylor13
Oct 16, 2024
Collaborator

Also -- do you really need second precision for climate work? (but I digress).

Not for most climate work, but even the 1-hour accuracy needed to get the hour of a day right (i.e., what part of the diurnal cycle are you in?) will only allow you to store about 150 years (and quite a few climate simulations are much longer than that). If you want minute accuracy, you could at most store about 2.5 years, and second accuracy limits you to about 15 days.

Note that 3-hourly data from models is often stored for at least a portion of 1000 year control runs (to analyze the diurnal cycle). To get an estimate of a time-derivative over a 3-hour interval you need to subtract the two adjacent time coordinates. If the time coordinate data were stored with minute accuracy, the interval could be off by as much as 0.5%. "Time" stored at 1 hour accuracy would be ruled out, and something close to second accuracy would be desirable (but then you could only store 15 days of the simulation, as noted above.

I think for regular integer data, the above maximums are increased by about a factor of 1000, so if you only need minute accuracy, you could handle data spanning a couple thousand years. Still, in that case, the units would have to be "minutes since ..." and again I'd have to divide by 302460 to get time in units of months, but as you say who looks at the raw coordinate values. Accurate time-interval calculation based on closely spaced time samples would also still be problematic.

We could convert any "float" examples to "integer" if that would make sense for the particular example, but for others I think we should use "double precision".

0 replies

DocOtak · 2024-10-18T22:10:42Z

DocOtak
Oct 18, 2024
Maintainer

In my group we had mixed "precision" times, sometimes we knew the time to the minute sometimes only to the day. At first, we would let xarray pick the best representation for us that kept the precision encoded in the units and used an integer dtype. Xarray would also start the epoch at the oldest date and count forward from 0 in the data (if sorted). We found however, that having a common epoch and duration for all our data was much more useful for our matlab users so we ended up with days since with a double dtype. Each time variable got a "resolution" attribute to store what the actual input precision was.

I don't think you can ever infer from the units and computer data type what the significant digits are in a scientific context, that is, floats and doubles are all basically "fixed significance" and for scientific significance you need to be explicitly told.

I would not support changing all example to use integer dtypes and agree with @taylor13 that the float ones could be changed to doubles.

Can you open an issue in the models that you use like FVCOM? I only work with one fortran programer and we (them and me) were getting different calculated outputs and it turns out they were just not used to using doubles and so didn't by default. Where my language defaulted to doubles.

0 replies

ChrisBarker-NOAA · 2024-10-21T22:23:50Z

ChrisBarker-NOAA
Oct 21, 2024
Collaborator Author

@DocOtak wrote:

Each time variable got a "resolution" attribute to store what the actual input precision was.

Now THAT is a good idea! I would very much like to see a convention for that -- would you all support adding that to CF?

time_precision = "s" // or 'hour', 'day', 'ms', etc.

Should we make a proposal for that?

Is there a precedent for defining precision in CF? all I see is:

8.3.8. Interpolation Parameters

  // Interpolation variable
  char tp_interpolation ;
    tp_interpolation:interpolation_name = "bi_quadratic_latitude_longitude" ;
    tp_interpolation:tie_point_mapping = "track: track_indices tp_track subarea_track
                                          scan: scan_indices tp_scan subarea_scan" ;
    tp_interpolation:interpolation_parameters =
         "ce1: ce1  ca2: ca2  ce3: ce3 interpolation_subarea_flags: interpolation_subarea_flags" ;
    tp_interpolation:computational_precision = "32" ;

And that has a different meaning, anyway -- e.g. "32" means a 32 bit float was used -- that we already know.

It's also talked about in 8.4. Lossy Compression via Quantization -- which maybe is what folks should do with tiem in double, but no one's going to :-)

Back to the topic at hand:

I propose:

Not using float for any examples, but continue to use double instead for some examples.
Changing some of the examples to use integer types -- e.g. seconds since.

I think it's best to have a variety in the docs, to emphasize the that multiple options are possible.

Optional: add a brief discussion of what units one might choose to use, and why.

If folks think this is a good idea, then I'll start a PR, and we can hash out the details.

Side note: I can't figure out what MATLAB is currently doing under the hood, but it does say its datetime type supports nanoseconds -- I'm guessing maybe int64?

1 reply

DocOtak Oct 21, 2024
Maintainer

I didn't come up with the resolution attribute, took that straight out of Argo. It has the same data type as the parent variable. So in our "days since" double precision variable, if we know something to the minute it resolution gets a value of 1/(24 * 60) so 0.0006944444444 (yay floats), if we only know the time to the day, resolution is a 1. Despite the inability to represent 1/1440 exactly, I like this generic resolution more than something specific to time as you have above. That said, I consider this resolution and the idea of significant figures to be aspects of uncertainty and if we wanted to do things right, we would focus on getting a more developed uncertainty section added.

I support your ideas about not using any floats in examples and changing some of them to be integer types. It's a little technical, but W3C likes you prefix a lot of their examples with a statement like "this example is non-normative". Not suggesting we do this, but I understand why they do it a little better now.

Re the brief discussion: It sure would be nice to have "The Annotated CF Conventions" as separate from the actual specifications...

Anecdote

While my group was arguing over this, I did some analysis of all the floating point data we had and looking at how far apart the closest values were. Basically looking for signs that indicate the data may have been rounded as a decimal or perhaps was once stored in another data type. We found that some of our data was at some point stored in half-precision floats. Our only real take away is that significant figures as a human readable string only really exists in things like journal publications and could basically never be inferred from how a computer was presenting it to you. Even the C_format attribute didn't give any indication of actual precision of our data... and software just ignored it and used the underlying float anyway.

ChrisBarker-NOAA · 2024-10-22T20:46:33Z

ChrisBarker-NOAA
Oct 22, 2024
Collaborator Author

Despite the inability to represent 1/1440 exactly, I like this generic resolution more than something specific to time as you have above.

as my motivation for specifying a resoluiton for time is specifically to clarify what a float tiem actualy means, I think a generic apporach is exactly wrong

That said, I consider this resolution and the idea of significant figures to be aspects of uncertainty and if we wanted to do things right, we would focus on getting a more developed uncertainty section added.

saying that the precision of the time is 0.0000115740740740740734993 days isn't all that helpful. In fact, it would take me some thought as to how to use that number.

Hmm -- maybe it is:

In [10]: datetime.timedelta(days=0.0000115740740)
Out[10]: datetime.timedelta(seconds=1)

so maybe?

That said, I consider this resolution and the idea of significant figures to be aspects of uncertainty and if we wanted to do things right, we would focus on getting a more developed uncertainty section added.

Well, yes, but this isn't uncertainty per say -- it is really about precision, which is sometimes reated to uncertainty.

In this example, lets say people have data that is exact to one second. they write that out into a double "days since". now the values carry a lot of "garbage" data -- anything that's less than a seconds is meaningless, but the end reader has no idea what fraction is meaningless -- accurate to second? that microsecond? who knows?

If yo example all the values, it may become clear what the precision originally was (which is what xarray does now, when it can) but without doing that there is no way to know.

And this probably does apply to virtually any data stored in floats, so maybe a:

precision attribute is called for for any variable?

Would that really need a fuller discussion of uncertainty?

Alternatively, make something specific for time.

It sure would be nice to have "The Annotated CF Conventions" as separate from the actual specifications...

Yes! or maybe guidance for creating / reading CF files -- I do think the reading part, in particular is under-documented.

0 replies

ChrisBarker-NOAA · 2024-10-23T00:04:41Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

PR here:

cf-convention/cf-conventions#557

Feedback more than welcome!

0 replies

davidhassell · 2024-10-23T07:54:50Z

davidhassell
Oct 23, 2024
Maintainer

Hi,

I'm not so keen on seeing coordinates that represent a continuous property (e.g. time) being represented as integers. Of course it's not wrong, but I don't think that it's good practice to implicitly endorse in the conventions. It opens the possibility of non-integer values being rounded when assigned to the variable, and can potentially complicate the post-processing of the data into non-integers (such as creating new coordinates for interpolated data).

Cheers,
David

0 replies

taylor13 · 2024-10-23T14:45:04Z

taylor13
Oct 23, 2024
Collaborator

I am also uncomfortable with integers. I see no advantages to them, compared with "double" (float) and David has described some disadvantages. I think the advantages claimed above for integers are questionable. So, I would not favor changing any of the float (double or single precision) time variables to integers. I do think for the reasons discussed above that all the (single precision) float time declarations should be changed to "double".

0 replies

ChrisBarker-NOAA · 2024-10-23T15:41:39Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

Sorry for jumping the gun. From the few who posted, I didn't think this was as controversial as it is. I tend to try to keep tinkgs moving while I"m thining about them, 'cause if I don't, It'll never get it finished.

I'll retract the PR for now, and (maybe) start an issue, depending on how this discussion goes.

@jonathan: I"m still a bit confused on discussion vs issue vs PR -- in my mind, when it's time to work on the text itself (the basic idea already having been agreed upon), a PR is the way to go. But I'll try to stick with the prescribed process :-)

Anyway, this one is still in the discussion stage.

0 replies

ChrisBarker-NOAA · 2024-10-23T16:17:30Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

I'm not so keen on seeing coordinates that represent a continuous property (e.g. time) being represented as integers. Of course it's not wrong, but I don't think that it's good practice to implicitly endorse in the conventions.

I'm going to respectfully disagree here -- as much as we would like them to be, floating point numbers are not actually continuous anyway. Rather, then are discontinuous in a hard to understand way, with precision varying according to the magnitude of the value.

This is well suited to much scientific computation -- you preserve a constant amount of information (similar to significant figures), but not well suited to a time coordinate.

For instance, if you use float days, you can get millisecond precision for one day, but well less than second precision after a few years. double is big enough to get second precision for thousands of years, so we get away with it, but that doesn't make it optimum.

I think there is a lot to learn from time libraries -- virtually all of them use integers -- it used to be seconds, now often milliseconds, and numpy, for instance lets you set it down to the picosecond -- but none use a floating point type. Even MATLAB, which used to use doubles (because it used double for everything) now appears to have an integer, down to the milisecond, datetime type.

If the data has a fixed precision, which is always will, (certainly model data) then it's best to capture that precision consistently. And the added bonus is that that precision is conveyed to the end user as well.

It opens the possibility of non-integer values being rounded when assigned to the variable,

It does, but that's a good thing :-) -- people SHOULD think about the precision of their data. And as above, it's actually rare that the original data starts out in a floating point data type.

and can potentially complicate the post-processing of the data into non-integers (such as creating new coordinates for interpolated data).

Not really -- integer to float is not hard -- unless you have values that can't be exactly represented as a floating point type -- but that's lost data -- so again, better that it's explicit.

Funny you should bring up those points -- I started this discussion because I've been involved with xarray's attempt to improve its handling of time encodings -- internally, it uses int64 time at nanosecond precision (they are working on maybe allowing other precisions, but that's not the point here). But most folks don't want nanosecond precsion in the output (cftime won't even allow it) -- so what to do?

Anyway, using floating point types in data files complicates all this -- rather than simplifying it -- in fact, it makes it somewhat intractable.

It's amazing how much we can get away with treating floating point numbers as real numbers (fortran even called them that!) but they aren't -- and they should always be used with caution, and only where appropriate.

Frankly, I would argue that one should rarely, if ever, use a floating point type for a time axis, but I'll concede that people have their reasons, but we certainly shouldn't imply in the CF docs that it's almost always the best practice.

NOTE: in the current version, the only integer times are "days" and "hours", which is interesting to me, but someone presumably did it thoughtfully, and I think it IS good to clearly say "this is daily (hourly) data".

0 replies

sethmcg · 2024-10-24T01:50:36Z

sethmcg
Oct 24, 2024
Collaborator

So, I'm coming down on the side of "both". I think there are good reasons to record time as an int, and good reasons to record it as a float, and I think there is no universal best answer; it's going to depend on context. So I would favor using floats in some examples and ints in others.

1 reply

sethmcg Oct 24, 2024
Collaborator

I think this question is beginning to touch on a broader subject, which is what's the appropriate timescale for the data in question.

Consider daily data. This could be a quantity that's representative of the day as a whole (e.g., daily total precip), or that is defined relative to the day (e.g., daily min/max temperature), or that is sampled once per day (e.g., instrumental measurements in a historical record). The natural timescale of daily data is the day, and it generally doesn't make sense to look at it on a shorter timescale. If you calculate a time difference between two adjacent values, it should have a value of 1 day.

You can enforce this timescale by storing the data as an integer with units of "days". In doing so, you elide any variations at shorter timescales, like going on or off of daylight savings time, or the insertion of leap seconds. Which is good and desirable; if you're working with daily data, you want the time difference between adjacent samples to always be 1 day; having small irregularities in the spacing of the time coordinate only makes things more difficult and confusing.

But it's not always possible to record time that way. If you calculate seasonal averages of your data (as is required by the CORDEX data request), you can't use a unit of "seasons". If you're combining data from models that use different calendars, you may need to use units of "years" to get things to line up, even though CF cautions against it. You might even need to use fractional years if you want to look at something like monthly averages across the ensemble.

Currently, CF has no mechanism for recording what the frequency of regularly-sampled data is. The cell_methods attribute records what it was before you aggregated it over a longer interval, but if you want to know what the result is now, you have to get some values from the time coordinate and difference them. The same is true of spatial resolution.

I think this is a deficiency that we should remedy, and I would say that it's part of what we're trying to accomplish over in #305. (Which is a bit stalled-out at the moment; mea culpa.) So if what I'm saying here makes sense, I'd invite people to check out that discussion and see if there's something to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF Conventions

Stop using floats in time coordinate examples? #383

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CF Conventions

Stop using floats in time coordinate examples? #383

ChrisBarker-NOAA Oct 16, 2024 Collaborator

Topic for discussion

Replies: 12 comments · 2 replies

taylor13 Oct 16, 2024 Collaborator

ChrisBarker-NOAA Oct 16, 2024 Collaborator Author

taylor13 Oct 16, 2024 Collaborator

DocOtak Oct 18, 2024 Maintainer

ChrisBarker-NOAA Oct 21, 2024 Collaborator Author

DocOtak Oct 21, 2024 Maintainer

Anecdote

ChrisBarker-NOAA Oct 22, 2024 Collaborator Author

ChrisBarker-NOAA Oct 23, 2024 Collaborator Author

davidhassell Oct 23, 2024 Maintainer

taylor13 Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator Author

ChrisBarker-NOAA Oct 23, 2024 Collaborator Author

sethmcg Oct 24, 2024 Collaborator

sethmcg Oct 24, 2024 Collaborator

ChrisBarker-NOAA
Oct 16, 2024
Collaborator

Replies: 12 comments 2 replies

taylor13
Oct 16, 2024
Collaborator

ChrisBarker-NOAA
Oct 16, 2024
Collaborator Author

taylor13
Oct 16, 2024
Collaborator

DocOtak
Oct 18, 2024
Maintainer

ChrisBarker-NOAA
Oct 21, 2024
Collaborator Author

DocOtak Oct 21, 2024
Maintainer

ChrisBarker-NOAA
Oct 22, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

davidhassell
Oct 23, 2024
Maintainer

taylor13
Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 23, 2024
Collaborator Author

sethmcg
Oct 24, 2024
Collaborator

sethmcg Oct 24, 2024
Collaborator