-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ncCF format requests to TableDAP #799
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I wish we could have a better solution at the ERDDAP server level though but re-reading the Google Group thread I don't think that is possible.
PS: I made some minor comments that are not too important.
io.BytesIO buffer object | ||
""" | ||
|
||
vstr = opendap.create_DAP_variable_str(url) # variable str from DDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: you called vstr
varstr
in the create_DAP_variable_str
function, it could helps others reading your code to use the same name here. Also, your variable names are too short 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guilty as charged... I might be getting influenced by too many scientists 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait! I'm a scientists!! Am I doing my job wrong ;-p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're cut from a different cloth felipe!
|
||
# encode as proper URL characters | ||
varstr = urllib.parse.quote(",".join(lst)) | ||
return varstr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is an ERDDAP trick to request the JSON with out any data, they this could be simply a request to get columnNames
.
@@ -723,7 +723,13 @@ def load_remote_dataset(self, ds_str): | |||
:param str ds_str: URL to the remote resource | |||
''' | |||
|
|||
if opendap.is_opendap(ds_str): | |||
if erddap.is_tabledap(ds_str): | |||
return Dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is slower than downloading the file and running the checks locally as you say, could something like a temporary file be an alternative faster solution (I did not run any tests)?
Downloading the file locally and removing it when done could be something like:
@contextmanager
def tempnc(data: BinaryIO) -> Generator[str, None, None]:
from tempfile import NamedTemporaryFile
tmp = None
try:
tmp = NamedTemporaryFile(suffix=".nc", prefix="compliance-checker_")
tmp.write(data)
tmp.flush()
yield tmp.name
finally:
if tmp is not None:
tmp.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some way to limit the amount of data requested? The goal of the check is really to check the dataset structure, not so much the contents of the whole dataset. I'm not really sure how this part of CC works, but is there a row limit or something that can be included in the ERDDAP request, if we're not doing that already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few tests that requires the data/coordinates but a .cdl
only test (metadata only) for online uses is something I've been wanting for a long time. Even though that would be a partial compliance test I believe that this is what we need 90%. However, do we have an ERDDAP response that returns only the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, need to check the data for some tests. Not sure about this one though, mostly attribute-only, with some dimension checks like this one. Possibly forgetting some things tho.
There's the .das
response, for example: http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.das.
This is what Bob pointed me to as most useful for checking dataset metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the OPeNDAP protocol, dimension information is typically be found in the DDS
, at the .dds
format, and would look something like this:
...
Float64 mask_v[eta_v = 290][xi_v = 332];
Float64 mask_psi[eta_psi = 290][xi_psi = 331];
Float32 zeta[time = 762][eta_rho = 291][xi_rho = 332];
Float32 u[time = 762][s_rho = 20][eta_u = 291][xi_u = 331];
Float32 v[time = 762][s_rho = 20][eta_v = 290][xi_v = 332];
...
However, a request to an ERDDAP source doesn't bring back the same structure:
dalton@ubuntu:~$ curl -L "http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.dds"
Dataset {
Sequence {
Float64 time;
Float64 latitude;
Float64 longitude;
Float64 z;
Float64 sea_water_velocity_to_direction;
Int32 sea_water_velocity_to_direction_qc_agg;
String sea_water_velocity_to_direction_qc_tests;
Float64 sea_water_speed;
Int32 sea_water_speed_qc_agg;
String sea_water_speed_qc_tests;
Float64 sea_water_temperature;
Int32 sea_water_temperature_qc_agg;
String sea_water_temperature_qc_tests;
Float64 peak_wave_period;
Int32 peak_wave_period_qc_agg;
String peak_wave_period_qc_tests;
Float64 sea_surface_wave_significant_height;
Int32 sea_surface_wave_significant_height_qc_agg;
String sea_surface_wave_significant_height_qc_tests;
Float64 sea_surface_wave_from_direction;
Int32 sea_surface_wave_from_direction_qc_agg;
String sea_surface_wave_from_direction_qc_tests;
String station;
} s;
} s;
I'll dig around in the documentation to see if that's a possibility.
For the time being, the logic in this commit produces the desired results of requesting the .ncCF
format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah. Duh, I've been staring at those forever, should have thought of that.
For the .das
response, the reason that may not show dimensions like other DAP servers is that the different ERDDAP output formats have different dimensionality, so it can't account for each of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf: there's also the option to read a netCDF file into memory. It should probably be preferred over writing to a tempfile when possible and when the netCDF-C library has been compiled with in-memory support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. But if we are aiming for a broader audience we should not rely on an option that may not be available, right?
PS: I believe that the conda-forge pkg is compiled with in-memory support but I'm not 100% sure. I'll check and fix it if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daltonkell where do we stand with merging this PR and resolving this dimension check issue? Interested to include this in the next RC, if possible.
Can we implement the dimension check, and maybe most or all of the IOOS checks, by requesting the .ncCFHeader
response instead of the full .ncCF
output? For some ERDDAP datasets, if there aren't limits placed on the request, any of the .nc
, .ncCF
, .ncCFMA
, or even .csv
output types could end up requesting a lot of data that could contribute to poor performance.
That may be too big a change though. If so, let's go with .ncCF
and see what results are for next RC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think including this in the next RC is appropriate, but we'll have to leave the .ncCFHeader
request out for another edition. It's a great idea and perhaps we can get some good contributions for it, but it requires finding a way to "switch off" the checks which examine data (which would obviously fail).
I'm currently working on merging concepts from this PR and Ben's latest, #800, because his implements a useful abstraction for handling any remote netCDF resource.
@ocefpaf The JSON and tempfile comments are very intriguing, I'll try to write some stuff up for that to see if it's feasible. Thanks for the feedback! |
UpdateBefore I went and tested the
That's reasonable. This PacIOOS dataset seems to take longer though:
In fact, in earlier tries this morning, this dataset took over 4 minutes to load up, obviously an unacceptable timeframe. 23 seconds is certainly not ideal, but... I have a hunch this is a mixture of IO and computation. When ERDDAP receives a request for a particular format, it must generate that format at request-time. If that data were not cached, ERDDAP must then open all the necessary files and then generate the requested format, then send the bytes over -- much like My first try at a SECOORA dataset:
|
bool | ||
""" | ||
|
||
if "tabledap" in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using url.lower() here just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use return "tabledap" in url.lower()
to simplify the return since it's returning boolean type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.
If there is no speed gain I prefer the approach you have here. Dealing with tempfiles is always troublesome, specially on Windows. |
bool | ||
""" | ||
|
||
if "tabledap" in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use return "tabledap" in url.lower()
to simplify the return since it's returning boolean type.
bool | ||
""" | ||
|
||
if "tabledap" in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.
with urllib.request.urlopen(f"{url}.dds") as resp: | ||
strb = io.StringIO(resp.read().decode()) | ||
|
||
strb.seek(8) # remove "Dataset " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a context manager/with
here.
""" | ||
|
||
vstr = opendap.create_DAP_variable_str(url) # variable str from DDS | ||
_url = f'{".".join([url, ftype])}?{vstr}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to maintain backwards compatibility with Python 3 versions less than 3.6. For this reason, str.format()
should be used instead of f-strings.
strb.close() | ||
|
||
# remove beginning and ending braces, split on newlines | ||
lst = list(filter(lambda x: "{" not in x and "}" not in x, x.split("\n"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to express this as a nested loop rather than reassigning to the same variable. If there are issues with the parsing, it makes it much easier to track down the issues because the variable reference isn't changed.
ERDDAP TableDAP Requests
When making a request to
TableDAP
, ensure the.ncCF
(Contiguous ragged array) format is returned. In order for the data to be returned, the full URl must be generated listing all the variables as theTableDAP
query. The binary data is fetched from the server, and then instantiated as anetCDF4.Dataset
representation, allowing proper checking.Users are only required to supply the base URL to the dataset, e.g.
This commit addresses #798.
It should be noted that this approach takes a significant amount of time to fetch the data, far longer than downloading the file and running the checker on a local file.
Unit tests are forthcoming after deliberating on this approach.