-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format of the location AggregationInstruction term #38
Comments
Well, to be honest, I think this is more immediately comprehensible than the other version! |
Hi Neil, The discussion on the current encoding can be found at the discussion starting at #20 (review). I'm firmly in the "keep what what we've got" camp. I'll try to explain ... Storing the fragment sizes is logically equivalent to storing the start/end indices, has much less redundancy (we don't need to encode Storing the sizes also removes the the need to define whether our indices start from one or zero - whichever we would choose would be arbitrary and would be sure to upset someone. In your example, you also use the Python (and other languages?) practice of the encoded end index being one larger than the actual end index (i.e. As I can think of no logical benefit to storing the start/end indices, as opposed to the sizes, and there are abstract reasons to prefer the latter, I think that we should keep the original formulation of the conventions rather than change it to make an implementation easier. Wait, I hear you say that the new suggestion is more readable - but not to me! Coming from the >>> import numpy as np
>>> import dask.array as da
>>> a = np.arange(12*73*144).reshape(12, 1, 73, 144)
>>> x = da.from_array(a, chunks=(6, 1, 73, 144)) # dask array similar to the above example
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> x.numblocks
(2, 1, 1, 1) I'm not using the In Python, the code to convert from sizes ( >>> from dask.utils import cached_cumsum
>>> from itertools import product
>>> cached_cumsum([1, 2, 3, 4, 5], initial_zero=True) # test cumsum ... all good
(0, 1, 3, 6, 10, 15)
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> cumdims = [cached_cumsum(bds, initial_zero=True) for bds in x.chunks]
>>> cumdims
[(0, 6, 12), (0, 1), (0, 73), (0, 144)]
>>> indices = [[(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, x.chunks)]
>>> indices
[[(0, 6), (6, 12)], [(0, 1)], [(0, 73)], [(0, 144)]]
>>> for index in product(*indices):
... print(index)
...
((0, 6), (0, 1), (0, 73), (0, 144))
((6, 12), (0, 1), (0, 73), (0, 144)) I know that it's probably a lot easier to write this in Python rather than C :) but it seems like a succinct and usable procedure. Cheers, |
Thanks for the reply @davidhassell. In the review I did write:
I'm happy to stick with the current way of doing it, as I've already implemented it and now had confirmation from you that:
The question remains as to how we specify very sparse arrays? |
I struggled a bit with implementing the location AggregationInstruction as a variable in a netCDF file.
Can we remember why we decided on the current format? Here's an example (example 3)
Here, the
i
andj
dimensions depend on how many dimensions the variable has and the maximum number of fragments across each dimension.From an implementation point of view it makes more sense to have location follow the dimensions for the other AggregationInstructions, with an extra dimension (
k
) that is always 2. Then the location consists of actual locations (rather than spans) and the file contains the information you need to index the parent array. The current method requires the parser to build the information required to actually index the parent array.I'm also a bit concerned that the current method does not really work for sparse arrays. Here is how I'd change the specification:
The main downside is that the representation is not as compact, but the CFA Aggregation Files are going to be very small compared to the Aggregated data anyway.
Let me know what you think @davidhassell , @bnlawrence , @JonathanGregory and @sadielbartholomew.
The text was updated successfully, but these errors were encountered: