Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ncCF format requests to TableDAP #799

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions compliance_checker/protocols/erddap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import io
import urllib.request
import compliance_checker.protocols.opendap as opendap

def is_tabledap(url):
"""
Identify a dataset as an ERDDAP TableDAP dataset.

Parameters
----------
url (str) : URL to dataset

Returns
-------
bool
"""

if "tabledap" in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using url.lower() here just to be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use return "tabledap" in url.lower() to simplify the return since it's returning boolean type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.

return True
return False

def get_tabledap_bytes(url, ftype):
"""
ERDDAP TableDAP returns an OPeNDAP "sequence" response by default
when no file extensions are provided. If a user wishes to get a dataset
from an ERDDAP TableDAP URL, append the desired file extension and return
a byte buffer object containing the data.

Parameters
----------
url (str) : URL to TableDAP dataset
ftype (str) : file format extension

Return
------
io.BytesIO buffer object
"""

vstr = opendap.create_DAP_variable_str(url) # variable str from DDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: you called vstr varstr in the create_DAP_variable_str function, it could helps others reading your code to use the same name here. Also, your variable names are too short 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guilty as charged... I might be getting influenced by too many scientists 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait! I'm a scientists!! Am I doing my job wrong ;-p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're cut from a different cloth felipe!

_url = f'{".".join([url, ftype])}?{vstr}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to maintain backwards compatibility with Python 3 versions less than 3.6. For this reason, str.format() should be used instead of f-strings.

with urllib.request.urlopen(_url) as resp:
return io.BytesIO(resp.read())
37 changes: 37 additions & 0 deletions compliance_checker/protocols/opendap.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,45 @@

Functions to assist in determining if the URL is an OPeNDAP endpoint
'''
import io
import requests
import urllib.parse
import urllib.request

def create_DAP_variable_str(url):
"""
Create a URL-encoded string of variables for a given DAP dataset.
Works on OPeNDAP datasets.

Parameters
----------
url (str): endpoint to *DAP dataset

Returns
-------
str
"""

# get dds
with urllib.request.urlopen(f"{url}.dds") as resp:
strb = io.StringIO(resp.read().decode())

strb.seek(8) # remove "Dataset "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a context manager/with here.

x = strb.read()
strb.close()

# remove beginning and ending braces, split on newlines
lst = list(filter(lambda x: "{" not in x and "}" not in x, x.split("\n")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to express this as a nested loop rather than reassigning to the same variable. If there are issues with the parsing, it makes it much easier to track down the issues because the variable reference isn't changed.


# remove all the extra space used in the DDS string
lst = list(filter(None, map(lambda x: x.strip(" "), lst)))

# now need to split from type, grab only the variable and remove ;
lst = list(map(lambda x: x.split(" ")[-1].strip(";"), lst))

# encode as proper URL characters
varstr = urllib.parse.quote(",".join(lst))
return varstr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is an ERDDAP trick to request the JSON with out any data, they this could be simply a request to get columnNames.


def is_opendap(url):
'''
Expand Down
10 changes: 8 additions & 2 deletions compliance_checker/suite.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from compliance_checker.cf.cf import CFBaseCheck
from owslib.sos import SensorObservationService
from owslib.swe.sensor.sml import SensorML
from compliance_checker.protocols import opendap, netcdf, cdl
from compliance_checker.protocols import opendap, netcdf, cdl, erddap
from compliance_checker.base import BaseCheck
from compliance_checker import MemoizedDataset
from collections import defaultdict
Expand Down Expand Up @@ -723,7 +723,13 @@ def load_remote_dataset(self, ds_str):
:param str ds_str: URL to the remote resource
'''

if opendap.is_opendap(ds_str):
if erddap.is_tabledap(ds_str):
return Dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is slower than downloading the file and running the checks locally as you say, could something like a temporary file be an alternative faster solution (I did not run any tests)?

Downloading the file locally and removing it when done could be something like:

@contextmanager
def tempnc(data: BinaryIO) -> Generator[str, None, None]:
    from tempfile import NamedTemporaryFile

    tmp = None
    try:
        tmp = NamedTemporaryFile(suffix=".nc", prefix="compliance-checker_")
        tmp.write(data)
        tmp.flush()
        yield tmp.name
    finally:
        if tmp is not None:
            tmp.close()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some way to limit the amount of data requested? The goal of the check is really to check the dataset structure, not so much the contents of the whole dataset. I'm not really sure how this part of CC works, but is there a row limit or something that can be included in the ERDDAP request, if we're not doing that already?

Copy link
Member

@ocefpaf ocefpaf Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few tests that requires the data/coordinates but a .cdl only test (metadata only) for online uses is something I've been wanting for a long time. Even though that would be a partial compliance test I believe that this is what we need 90%. However, do we have an ERDDAP response that returns only the metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, need to check the data for some tests. Not sure about this one though, mostly attribute-only, with some dimension checks like this one. Possibly forgetting some things tho.

There's the .das response, for example: http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.das.

This is what Bob pointed me to as most useful for checking dataset metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the OPeNDAP protocol, dimension information is typically be found in the DDS, at the .dds format, and would look something like this:

...
    Float64 mask_v[eta_v = 290][xi_v = 332];
    Float64 mask_psi[eta_psi = 290][xi_psi = 331];
    Float32 zeta[time = 762][eta_rho = 291][xi_rho = 332];
    Float32 u[time = 762][s_rho = 20][eta_u = 291][xi_u = 331];
    Float32 v[time = 762][s_rho = 20][eta_v = 290][xi_v = 332];
...

However, a request to an ERDDAP source doesn't bring back the same structure:

dalton@ubuntu:~$ curl -L "http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.dds"                                               
Dataset {
  Sequence {
    Float64 time;
    Float64 latitude;
    Float64 longitude;
    Float64 z;
    Float64 sea_water_velocity_to_direction;
    Int32 sea_water_velocity_to_direction_qc_agg;
    String sea_water_velocity_to_direction_qc_tests;
    Float64 sea_water_speed;
    Int32 sea_water_speed_qc_agg;
    String sea_water_speed_qc_tests;
    Float64 sea_water_temperature;
    Int32 sea_water_temperature_qc_agg;
    String sea_water_temperature_qc_tests;
    Float64 peak_wave_period;
    Int32 peak_wave_period_qc_agg;
    String peak_wave_period_qc_tests;
    Float64 sea_surface_wave_significant_height;
    Int32 sea_surface_wave_significant_height_qc_agg;
    String sea_surface_wave_significant_height_qc_tests;
    Float64 sea_surface_wave_from_direction;
    Int32 sea_surface_wave_from_direction_qc_agg;
    String sea_surface_wave_from_direction_qc_tests;
    String station;
  } s;
} s;

I'll dig around in the documentation to see if that's a possibility.

For the time being, the logic in this commit produces the desired results of requesting the .ncCF format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah. Duh, I've been staring at those forever, should have thought of that.

For the .das response, the reason that may not show dimensions like other DAP servers is that the different ERDDAP output formats have different dimensionality, so it can't account for each of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf: there's also the option to read a netCDF file into memory. It should probably be preferred over writing to a tempfile when possible and when the netCDF-C library has been compiled with in-memory support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But if we are aiming for a broader audience we should not rely on an option that may not be available, right?

PS: I believe that the conda-forge pkg is compiled with in-memory support but I'm not 100% sure. I'll check and fix it if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daltonkell where do we stand with merging this PR and resolving this dimension check issue? Interested to include this in the next RC, if possible.

Can we implement the dimension check, and maybe most or all of the IOOS checks, by requesting the .ncCFHeader response instead of the full .ncCF output? For some ERDDAP datasets, if there aren't limits placed on the request, any of the .nc, .ncCF, .ncCFMA, or even .csv output types could end up requesting a lot of data that could contribute to poor performance.

That may be too big a change though. If so, let's go with .ncCF and see what results are for next RC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including this in the next RC is appropriate, but we'll have to leave the .ncCFHeader request out for another edition. It's a great idea and perhaps we can get some good contributions for it, but it requires finding a way to "switch off" the checks which examine data (which would obviously fail).

I'm currently working on merging concepts from this PR and Ben's latest, #800, because his implements a useful abstraction for handling any remote netCDF resource.

ds_str,
mode="r",
memory=erddap.get_tabledap_bytes(ds_str, "ncCF").getbuffer()
)
elif opendap.is_opendap(ds_str):
benjwadams marked this conversation as resolved.
Show resolved Hide resolved
return Dataset(ds_str)
else:
# Check if the HTTP response is XML, if it is, it's likely SOS so
Expand Down