Stop using floats in time coordinate examples? #383
Replies: 12 comments 2 replies
-
Regarding: I recognize that single precision floats are rarely a good choice for a time coordinate when storing climate data. I am not opposed to eliminating instances of the "float time" examples, but in most cases I would replace them with "double precision time" examples, rather than integers. In climate research I find a double precision float representation of time best because I can express a time in days since with precision of at least seconds and yet for data spanning months to centuries, I can easily approximately convert that to months (or even years) by dividing in my head. If I were to represent the time as an integer with units of seconds, I would need to divide by ~302460*60 (whatever number that is) to get months. I would have to resort to a calculator to simply get an idea of what time period is spanned by the time coordinates. If a library can't can't retain the precision of a double precision float in whatever conversions it makes, I'd say that library should not be used. I would note that much climate data is stored at a precision much higher than is warranted by its accuracy. That, of course, means it occupies more storage than is necessary, but no one interprets the precision of data stored in a netCDF file as somehow implying that all its digits are significant. |
Beta Was this translation helpful? Give feedback.
-
Hmm -- good point. units of days (or hours, depending) can be more human-readable, for sure. Though I at least don't very often look at the raw values anyway -- ncdump will convert to ISO timestrings for you, and any of the computer reading tools are converting one way or another anyway. Frankly, the "time_unit since" representation isn't very human friendly at all. Anyway, I'm not suggesting that we disallow, or even recommend against, using floating point for time. But I do think it's not ideal to have it for ALL the examples. Ironically, the only ones that use an integer time are hours or days. Also -- do you really need second precision for climate work? (but I digress)
Sure -- but the problem is that you don't know how many digits are significant -- most importantly, automated tools don't know. |
Beta Was this translation helpful? Give feedback.
-
Not for most climate work, but even the 1-hour accuracy needed to get the hour of a day right (i.e., what part of the diurnal cycle are you in?) will only allow you to store about 150 years (and quite a few climate simulations are much longer than that). If you want minute accuracy, you could at most store about 2.5 years, and second accuracy limits you to about 15 days. Note that 3-hourly data from models is often stored for at least a portion of 1000 year control runs (to analyze the diurnal cycle). To get an estimate of a time-derivative over a 3-hour interval you need to subtract the two adjacent time coordinates. If the time coordinate data were stored with minute accuracy, the interval could be off by as much as 0.5%. "Time" stored at 1 hour accuracy would be ruled out, and something close to second accuracy would be desirable (but then you could only store 15 days of the simulation, as noted above. I think for regular integer data, the above maximums are increased by about a factor of 1000, so if you only need minute accuracy, you could handle data spanning a couple thousand years. Still, in that case, the units would have to be "minutes since ..." and again I'd have to divide by 302460 to get time in units of months, but as you say who looks at the raw coordinate values. Accurate time-interval calculation based on closely spaced time samples would also still be problematic. We could convert any "float" examples to "integer" if that would make sense for the particular example, but for others I think we should use "double precision". |
Beta Was this translation helpful? Give feedback.
-
In my group we had mixed "precision" times, sometimes we knew the time to the minute sometimes only to the day. At first, we would let xarray pick the best representation for us that kept the precision encoded in the units and used an integer dtype. Xarray would also start the epoch at the oldest date and count forward from 0 in the data (if sorted). We found however, that having a common epoch and duration for all our data was much more useful for our matlab users so we ended up with days since with a double dtype. Each time variable got a "resolution" attribute to store what the actual input precision was. I don't think you can ever infer from the units and computer data type what the significant digits are in a scientific context, that is, floats and doubles are all basically "fixed significance" and for scientific significance you need to be explicitly told. I would not support changing all example to use integer dtypes and agree with @taylor13 that the float ones could be changed to doubles. Can you open an issue in the models that you use like FVCOM? I only work with one fortran programer and we (them and me) were getting different calculated outputs and it turns out they were just not used to using doubles and so didn't by default. Where my language defaulted to doubles. |
Beta Was this translation helpful? Give feedback.
-
@DocOtak wrote:
Now THAT is a good idea! I would very much like to see a convention for that -- would you all support adding that to CF?
Should we make a proposal for that? Is there a precedent for defining precision in CF? all I see is: 8.3.8. Interpolation Parameters
And that has a different meaning, anyway -- e.g. "32" means a 32 bit float was used -- that we already know. It's also talked about in 8.4. Lossy Compression via Quantization -- which maybe is what folks should do with tiem in double, but no one's going to :-) Back to the topic at hand: I propose:
Optional: add a brief discussion of what units one might choose to use, and why. If folks think this is a good idea, then I'll start a PR, and we can hash out the details. Side note: I can't figure out what MATLAB is currently doing under the hood, but it does say its datetime type supports nanoseconds -- I'm guessing maybe int64? |
Beta Was this translation helpful? Give feedback.
-
as my motivation for specifying a resoluiton for time is specifically to clarify what a float tiem actualy means, I think a generic apporach is exactly wrong That said, I consider this resolution and the idea of significant figures to be aspects of uncertainty and if we wanted to do things right, we would focus on getting a more developed uncertainty section added. saying that the precision of the time is 0.0000115740740740740734993 days isn't all that helpful. In fact, it would take me some thought as to how to use that number. Hmm -- maybe it is:
so maybe?
Well, yes, but this isn't uncertainty per say -- it is really about precision, which is sometimes reated to uncertainty. In this example, lets say people have data that is exact to one second. they write that out into a double "days since". now the values carry a lot of "garbage" data -- anything that's less than a seconds is meaningless, but the end reader has no idea what fraction is meaningless -- accurate to second? that microsecond? who knows? If yo example all the values, it may become clear what the precision originally was (which is what xarray does now, when it can) but without doing that there is no way to know. And this probably does apply to virtually any data stored in floats, so maybe a:
Would that really need a fuller discussion of uncertainty? Alternatively, make something specific for time.
Yes! or maybe guidance for creating / reading CF files -- I do think the reading part, in particular is under-documented. |
Beta Was this translation helpful? Give feedback.
-
PR here: cf-convention/cf-conventions#557 Feedback more than welcome! |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm not so keen on seeing coordinates that represent a continuous property (e.g. time) being represented as integers. Of course it's not wrong, but I don't think that it's good practice to implicitly endorse in the conventions. It opens the possibility of non-integer values being rounded when assigned to the variable, and can potentially complicate the post-processing of the data into non-integers (such as creating new coordinates for interpolated data). Cheers, |
Beta Was this translation helpful? Give feedback.
-
I am also uncomfortable with integers. I see no advantages to them, compared with "double" (float) and David has described some disadvantages. I think the advantages claimed above for integers are questionable. So, I would not favor changing any of the float (double or single precision) time variables to integers. I do think for the reasons discussed above that all the (single precision) float time declarations should be changed to "double". |
Beta Was this translation helpful? Give feedback.
-
Sorry for jumping the gun. From the few who posted, I didn't think this was as controversial as it is. I tend to try to keep tinkgs moving while I"m thining about them, 'cause if I don't, It'll never get it finished. I'll retract the PR for now, and (maybe) start an issue, depending on how this discussion goes. @jonathan: I"m still a bit confused on discussion vs issue vs PR -- in my mind, when it's time to work on the text itself (the basic idea already having been agreed upon), a PR is the way to go. But I'll try to stick with the prescribed process :-) Anyway, this one is still in the discussion stage. |
Beta Was this translation helpful? Give feedback.
-
I'm going to respectfully disagree here -- as much as we would like them to be, floating point numbers are not actually continuous anyway. Rather, then are discontinuous in a hard to understand way, with precision varying according to the magnitude of the value. This is well suited to much scientific computation -- you preserve a constant amount of information (similar to significant figures), but not well suited to a time coordinate. For instance, if you use float days, you can get millisecond precision for one day, but well less than second precision after a few years. double is big enough to get second precision for thousands of years, so we get away with it, but that doesn't make it optimum. I think there is a lot to learn from time libraries -- virtually all of them use integers -- it used to be seconds, now often milliseconds, and numpy, for instance lets you set it down to the picosecond -- but none use a floating point type. Even MATLAB, which used to use doubles (because it used double for everything) now appears to have an integer, down to the milisecond, datetime type. If the data has a fixed precision, which is always will, (certainly model data) then it's best to capture that precision consistently. And the added bonus is that that precision is conveyed to the end user as well.
It does, but that's a good thing :-) -- people SHOULD think about the precision of their data. And as above, it's actually rare that the original data starts out in a floating point data type.
Not really -- integer to float is not hard -- unless you have values that can't be exactly represented as a floating point type -- but that's lost data -- so again, better that it's explicit. Funny you should bring up those points -- I started this discussion because I've been involved with xarray's attempt to improve its handling of time encodings -- internally, it uses int64 time at nanosecond precision (they are working on maybe allowing other precisions, but that's not the point here). But most folks don't want nanosecond precsion in the output (cftime won't even allow it) -- so what to do? Anyway, using floating point types in data files complicates all this -- rather than simplifying it -- in fact, it makes it somewhat intractable. It's amazing how much we can get away with treating floating point numbers as real numbers (fortran even called them that!) but they aren't -- and they should always be used with caution, and only where appropriate. Frankly, I would argue that one should rarely, if ever, use a floating point type for a time axis, but I'll concede that people have their reasons, but we certainly shouldn't imply in the CF docs that it's almost always the best practice. NOTE: in the current version, the only integer times are "days" and "hours", which is interesting to me, but someone presumably did it thoughtfully, and I think it IS good to clearly say "this is daily (hourly) data". |
Beta Was this translation helpful? Give feedback.
-
So, I'm coming down on the side of "both". I think there are good reasons to record time as an int, and good reasons to record it as a float, and I think there is no universal best answer; it's going to depend on context. So I would favor using floats in some examples and ints in others. |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
I've noticed that a lot of the output from oceanographic models (what I work with) uses a floating point dtype in the time coordinate, e.g.:
This is, of course perfectly CF compliant, but floating point is not the best choice for a time variable:
If single precision, you lose second precision after about 3 years (which has caused me problems).
If double precision, you have millions of years with second precision, but then you have too much precision -- downstream tools have no idea how many of those digits are important. (this is relevant to, say, xarray, which is currently using nanoseconds precision be default internally, which is then problematic when you write it out again, as many tools (e.g. cftime) don't handle nanoseconds).
Anyway, I think most would agree that floating point is not the the best data type for time. As far as I know, every time library uses integers. (except MS Excel, but I won't go there...)
This confused me, as internally, most (all?) models use integers (usually seconds) internally -- so why write it out as floating point days or hours??
I think I may know why:
Almost all the examples in the CF doc use floating point types for time, e.g.:
Example 4.4. Time axis
And that is the initial example.
I just searched through the entire document, and found only two examples that are integer time types:
Example 5.7. Lambert conformal projection
Example 7.15. Timeseries with geometry.
and those are units of hours and days, so not suitable if you want sub-hour precision.
Some even use single precision (I think these are all in section 5: Coordinate Systems and Domain):
Example 5.21. A two-dimensional UGRID mesh topology variable
Ouch! I should have noticed that a long time ago! [1]
Anyway -- a lot of people (myself included) learn by example -- rather than reading the spec carefully, they look for an example that seems to be what they need, and copy that. Because of that, I think we really should have the example follow best practices -- and I don't think floating point for time is generally a best practice.
My proposal: we replace (many of) the examples of time coordinates in the CF doc with integer type examples.
This is not a change of the convention at all -- I don't think CF says anything about what data type to use for, well, anything, not even as a recommendation. (should it make recommendations?) -- but what is the barrier to entry for changing examples in this way?
[1] I had never really thought about this until recently -- I used various num2date() functions, and they just worked. But I really became aware when I noticed that when working with output from FVCOM (a unstructured mesh oceanographic model), I was getting timestamps like 12:00:18, when I should have gotten a nice round 12:00:00. and this made a mess because I wanted to use the results at 12:00:00, and got an out. of time bounds error. Anyway, turns out they were using float
days since
for teh time axis.And now I think I may know why -- because that's the example in the UGRID example!
Beta Was this translation helpful? Give feedback.
All reactions