Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format of the location AggregationInstruction term #38

Open
nmassey001 opened this issue May 24, 2022 · 3 comments
Open

Format of the location AggregationInstruction term #38

nmassey001 opened this issue May 24, 2022 · 3 comments

Comments

@nmassey001
Copy link
Collaborator

I struggled a bit with implementing the location AggregationInstruction as a variable in a netCDF file.
Can we remember why we decided on the current format? Here's an example (example 3)

    dimensions:
      // Extra dimensions
      i = 4 ;
      j = 2 ;
    variables:
      // Aggregation definition variables
      int location(i, j) ;
    data:
      location = 6, 6,
                 1, _,
                 73, _,
                 144, _ ;

Here, the i and j dimensions depend on how many dimensions the variable has and the maximum number of fragments across each dimension.

From an implementation point of view it makes more sense to have location follow the dimensions for the other AggregationInstructions, with an extra dimension (k) that is always 2. Then the location consists of actual locations (rather than spans) and the file contains the information you need to index the parent array. The current method requires the parser to build the information required to actually index the parent array.
I'm also a bit concerned that the current method does not really work for sparse arrays. Here is how I'd change the specification:

    dimensions:
      // Fragment dimensions
      f_time = 2 ;
      f_level = 1 ;
      f_latitude = 1 ;
      f_longitude = 1 ;
      // Extra dimensions
      k = 2;
    variables:
      // Aggregation definition variables
      int location(f_time, f_level, f_latitude, f_longitude, k) ;
    data:
      location = 0, 6,
                 0, 1,
                 0, 73,
                 0, 144,

                 6, 12,
                 0, 1,
                 0, 73,
                 0, 144 ;

The main downside is that the representation is not as compact, but the CFA Aggregation Files are going to be very small compared to the Aggregated data anyway.

Let me know what you think @davidhassell , @bnlawrence , @JonathanGregory and @sadielbartholomew.

@bnlawrence
Copy link
Collaborator

Well, to be honest, I think this is more immediately comprehensible than the other version!

@davidhassell
Copy link
Collaborator

Hi Neil,

The discussion on the current encoding can be found at the discussion starting at #20 (review). I'm firmly in the "keep what what we've got" camp. I'll try to explain ...

Storing the fragment sizes is logically equivalent to storing the start/end indices, has much less redundancy (we don't need to encode 0, 144 twice), and we don't need to introduce yet another non-physical dimension (k). It also uses half as much disk space, and therefore requires less data to be read. As you say, this is small relative to the full dataset, but for a case with a 1.5 million of fragments (a real use-case), this saving will be O(6 MB) assuming we're using 32-bit ints, O(12 MB) if 64-bit - is that significant?

Storing the sizes also removes the the need to define whether our indices start from one or zero - whichever we would choose would be arbitrary and would be sure to upset someone. In your example, you also use the Python (and other languages?) practice of the encoded end index being one larger than the actual end index (i.e. 0, 6 as opposed to 0, 5), which can be convenient but is not ideal for language-agnostic conventions.

As I can think of no logical benefit to storing the start/end indices, as opposed to the sizes, and there are abstract reasons to prefer the latter, I think that we should keep the original formulation of the conventions rather than change it to make an implementation easier.

Wait, I hear you say that the new suggestion is more readable - but not to me! Coming from the dask world, sizes seem very natural to me, as that is exactly how dask stores its chunksizes:

>>> import numpy as np
>>> import dask.array as da
>>> a = np.arange(12*73*144).reshape(12, 1, 73, 144)
>>> x = da.from_array(a, chunks=(6, 1, 73, 144))       # dask array similar to the above example
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> x.numblocks
(2, 1, 1, 1)

I'm not using the dask implementation to support my reasons above for preferring sizes, just to show how I'm used to using them, and so am comfortable with them. If dask used start/end indices, I'm sure I'd tend to click with them better (as I used to). Even in that case, I'd still prefer sizes in the CFA conventions, for the arguments put forward in #20 and here.

In Python, the code to convert from sizes ([6, 1, 73, 144]) to indices ([(6, 12), (0, 1), (0, 73), (0, 144)]) is essentially just a walk-through of the cumulative sums of the sizes along each dimension:

>>> from dask.utils import cached_cumsum
>>> from itertools import product
>>> cached_cumsum([1, 2, 3, 4, 5], initial_zero=True)  # test cumsum ... all good
(0, 1, 3, 6, 10, 15)
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> cumdims = [cached_cumsum(bds, initial_zero=True) for bds in x.chunks]
>>> cumdims
[(0, 6, 12), (0, 1), (0, 73), (0, 144)]
>>> indices = [[(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, x.chunks)]
>>> indices
[[(0, 6), (6, 12)], [(0, 1)], [(0, 73)], [(0, 144)]]
>>> for index in product(*indices):
...     print(index)
...
((0, 6), (0, 1), (0, 73), (0, 144))
((6, 12), (0, 1), (0, 73), (0, 144))

I know that it's probably a lot easier to write this in Python rather than C :) but it seems like a succinct and usable procedure.

Cheers,
David

@nmassey001
Copy link
Collaborator Author

Thanks for the reply @davidhassell. In the review I did write:

I'm all for minimal and compact representations, as you probably know, so I'd be happy with this change

I'm happy to stick with the current way of doing it, as I've already implemented it and now had confirmation from you that:

  1. Reconstructing the indices is an iterative procedure
  2. It is expected that the indices will need to be reconstructed before doing anything useful

The question remains as to how we specify very sparse arrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants