Enable ncCF format requests to TableDAP #799

daltonkell · 2020-04-20T13:17:41Z

ERDDAP TableDAP Requests

When making a request to TableDAP, ensure the .ncCF (Contiguous ragged array) format is returned. In order for the data to be returned, the full URl must be generated listing all the variables as the TableDAP query. The binary data is fetched from the server, and then instantiated as a netCDF4.Dataset representation, allowing proper checking.

Users are only required to supply the base URL to the dataset, e.g.

$ compliance-checker -t ioos "http://data.glos.us/erddap/tabledap/glerlwe2"

This commit addresses #798.

It should be noted that this approach takes a significant amount of time to fetch the data, far longer than downloading the file and running the checker on a local file.

Unit tests are forthcoming after deliberating on this approach.

ocefpaf

LGTM! I wish we could have a better solution at the ERDDAP server level though but re-reading the Google Group thread I don't think that is possible.

PS: I made some minor comments that are not too important.

ocefpaf · 2020-04-20T14:03:08Z

compliance_checker/protocols/erddap.py

+    io.BytesIO buffer object
+    """
+
+    vstr = opendap.create_DAP_variable_str(url) # variable str from DDS


Minor comment: you called vstr varstr in the create_DAP_variable_str function, it could helps others reading your code to use the same name here. Also, your variable names are too short 😄

Guilty as charged... I might be getting influenced by too many scientists 😆

Wait! I'm a scientists!! Am I doing my job wrong ;-p

you're cut from a different cloth felipe!

ocefpaf · 2020-04-20T14:13:19Z

compliance_checker/protocols/opendap.py

+
+    # encode as proper URL characters
+    varstr = urllib.parse.quote(",".join(lst))
+    return varstr


I wonder if there is an ERDDAP trick to request the JSON with out any data, they this could be simply a request to get columnNames.

ocefpaf · 2020-04-20T14:24:00Z

compliance_checker/suite.py

@@ -723,7 +723,13 @@ def load_remote_dataset(self, ds_str):
        :param str ds_str: URL to the remote resource
        '''

-        if opendap.is_opendap(ds_str):
+        if erddap.is_tabledap(ds_str):
+            return Dataset(


If this is slower than downloading the file and running the checks locally as you say, could something like a temporary file be an alternative faster solution (I did not run any tests)?

Downloading the file locally and removing it when done could be something like:

@contextmanager def tempnc(data: BinaryIO) -> Generator[str, None, None]: from tempfile import NamedTemporaryFile tmp = None try: tmp = NamedTemporaryFile(suffix=".nc", prefix="compliance-checker_") tmp.write(data) tmp.flush() yield tmp.name finally: if tmp is not None: tmp.close()

Do we need some way to limit the amount of data requested? The goal of the check is really to check the dataset structure, not so much the contents of the whole dataset. I'm not really sure how this part of CC works, but is there a row limit or something that can be included in the ERDDAP request, if we're not doing that already?

There are a few tests that requires the data/coordinates but a .cdl only test (metadata only) for online uses is something I've been wanting for a long time. Even though that would be a partial compliance test I believe that this is what we need 90%. However, do we have an ERDDAP response that returns only the metadata?

That's true, need to check the data for some tests. Not sure about this one though, mostly attribute-only, with some dimension checks like this one. Possibly forgetting some things tho.

There's the .das response, for example: http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.das.

This is what Bob pointed me to as most useful for checking dataset metadata.

With the OPeNDAP protocol, dimension information is typically be found in the DDS, at the .dds format, and would look something like this:

... Float64 mask_v[eta_v = 290][xi_v = 332]; Float64 mask_psi[eta_psi = 290][xi_psi = 331]; Float32 zeta[time = 762][eta_rho = 291][xi_rho = 332]; Float32 u[time = 762][s_rho = 20][eta_u = 291][xi_u = 331]; Float32 v[time = 762][s_rho = 20][eta_v = 290][xi_v = 332]; ...

However, a request to an ERDDAP source doesn't bring back the same structure:

dalton@ubuntu:~$ curl -L "http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.dds" Dataset { Sequence { Float64 time; Float64 latitude; Float64 longitude; Float64 z; Float64 sea_water_velocity_to_direction; Int32 sea_water_velocity_to_direction_qc_agg; String sea_water_velocity_to_direction_qc_tests; Float64 sea_water_speed; Int32 sea_water_speed_qc_agg; String sea_water_speed_qc_tests; Float64 sea_water_temperature; Int32 sea_water_temperature_qc_agg; String sea_water_temperature_qc_tests; Float64 peak_wave_period; Int32 peak_wave_period_qc_agg; String peak_wave_period_qc_tests; Float64 sea_surface_wave_significant_height; Int32 sea_surface_wave_significant_height_qc_agg; String sea_surface_wave_significant_height_qc_tests; Float64 sea_surface_wave_from_direction; Int32 sea_surface_wave_from_direction_qc_agg; String sea_surface_wave_from_direction_qc_tests; String station; } s; } s;

I'll dig around in the documentation to see if that's a possibility.

For the time being, the logic in this commit produces the desired results of requesting the .ncCF format.

Oh, yeah. Duh, I've been staring at those forever, should have thought of that.

For the .das response, the reason that may not show dimensions like other DAP servers is that the different ERDDAP output formats have different dimensionality, so it can't account for each of them.

@ocefpaf: there's also the option to read a netCDF file into memory. It should probably be preferred over writing to a tempfile when possible and when the netCDF-C library has been compiled with in-memory support.

Indeed. But if we are aiming for a broader audience we should not rely on an option that may not be available, right?

PS: I believe that the conda-forge pkg is compiled with in-memory support but I'm not 100% sure. I'll check and fix it if not.

@daltonkell where do we stand with merging this PR and resolving this dimension check issue? Interested to include this in the next RC, if possible.

Can we implement the dimension check, and maybe most or all of the IOOS checks, by requesting the .ncCFHeader response instead of the full .ncCF output? For some ERDDAP datasets, if there aren't limits placed on the request, any of the .nc, .ncCF, .ncCFMA, or even .csv output types could end up requesting a lot of data that could contribute to poor performance.

That may be too big a change though. If so, let's go with .ncCF and see what results are for next RC.

I think including this in the next RC is appropriate, but we'll have to leave the .ncCFHeader request out for another edition. It's a great idea and perhaps we can get some good contributions for it, but it requires finding a way to "switch off" the checks which examine data (which would obviously fail).

I'm currently working on merging concepts from this PR and Ben's latest, #800, because his implements a useful abstraction for handling any remote netCDF resource.

daltonkell · 2020-04-20T14:31:52Z

@ocefpaf The JSON and tempfile comments are very intriguing, I'll try to write some stuff up for that to see if it's feasible. Thanks for the feedback!

daltonkell · 2020-04-20T16:02:22Z

Update

Before I went and tested the tempfile route, I thought to test against a few more remote datasets. It's possible that the length of time is a direct result of where the data is coming from. For instance,

$ time python cchecker.py -t ioos "http://testing.erddap.axds.co/erddap/tabledap/sun2wave_timeseries"

...

real    0m2.121s
user    0m0.828s
sys     0m0.537s

That's reasonable.

This PacIOOS dataset seems to take longer though:

$ time python cchecker.py -t ioos "https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04"

 ...

real    0m23.115s
user    0m1.810s
sys     0m4.256s

In fact, in earlier tries this morning, this dataset took over 4 minutes to load up, obviously an unacceptable timeframe. 23 seconds is certainly not ideal, but...

I have a hunch this is a mixture of IO and computation. When ERDDAP receives a request for a particular format, it must generate that format at request-time. If that data were not cached, ERDDAP must then open all the necessary files and then generate the requested format, then send the bytes over -- much like TDS.

My first try at a SECOORA dataset:

$ time python cchecker.py -t ioos http://erddap.secoora.org/erddap/tabledap/edu_usf_marine_comps_1407d550

...

real    2m15.368s
user    0m1.204s
sys     0m6.769s

Bobfrat · 2020-04-20T17:58:34Z

compliance_checker/protocols/erddap.py

+    bool
+    """
+
+    if "tabledap" in url:


I'd suggest using url.lower() here just to be safe.

Just use return "tabledap" in url.lower() to simplify the return since it's returning boolean type.

You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.

ocefpaf · 2020-04-20T18:37:29Z

Before I went and tested the tempfile route, I thought to test against a few more remote datasets. It's possible that the length of time is a direct result of where the data is coming from.

If there is no speed gain I prefer the approach you have here. Dealing with tempfiles is always troublesome, specially on Windows.

benjwadams · 2020-04-24T19:34:28Z

compliance_checker/protocols/erddap.py

+    bool
+    """
+
+    if "tabledap" in url:


Just use return "tabledap" in url.lower() to simplify the return since it's returning boolean type.

benjwadams · 2020-04-24T19:41:35Z

compliance_checker/protocols/erddap.py

+    bool
+    """
+
+    if "tabledap" in url:


You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.

benjwadams · 2020-04-24T19:43:10Z

compliance_checker/protocols/opendap.py

+    with urllib.request.urlopen(f"{url}.dds") as resp:
+        strb = io.StringIO(resp.read().decode())
+
+    strb.seek(8) # remove "Dataset "


Consider using a context manager/with here.

compliance_checker/suite.py

benjwadams · 2020-04-24T19:49:15Z

compliance_checker/protocols/erddap.py

+    """
+
+    vstr = opendap.create_DAP_variable_str(url) # variable str from DDS
+    _url = f'{".".join([url, ftype])}?{vstr}'


We want to maintain backwards compatibility with Python 3 versions less than 3.6. For this reason, str.format() should be used instead of f-strings.

benjwadams · 2020-04-24T19:51:20Z

compliance_checker/protocols/opendap.py

+    strb.close()
+
+    # remove beginning and ending braces, split on newlines
+    lst = list(filter(lambda x: "{" not in x and "}" not in x, x.split("\n")))


Try to express this as a nested loop rather than reassigning to the same variable. If there are issues with the parsing, it makes it much easier to track down the issues because the variable reference isn't changed.

Enable ncCF format requests to TableDAP

68fd519

daltonkell requested review from ocefpaf, mwengren, benjwadams and Bobfrat April 20, 2020 13:17

ocefpaf approved these changes Apr 20, 2020

View reviewed changes

Bobfrat reviewed Apr 20, 2020

View reviewed changes

benjwadams requested changes Apr 24, 2020

View reviewed changes

benjwadams mentioned this pull request Apr 27, 2020

Process Content-Type header values of application/x-netcdf #800

Closed

daltonkell mentioned this pull request Apr 30, 2020

Remote NetCDF TableDAP #801

Merged

daltonkell closed this May 7, 2020

This was referenced May 7, 2020

IOOS:1.2 platform check #748

Closed

Use .ncCFHeader ERDDAP format for Platform dimension tests and other IOOS 1.2 checks #805

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ncCF format requests to TableDAP #799

Enable ncCF format requests to TableDAP #799

daltonkell commented Apr 20, 2020

ocefpaf left a comment

ocefpaf Apr 20, 2020

daltonkell Apr 20, 2020

ocefpaf Apr 20, 2020

Bobfrat Apr 20, 2020

ocefpaf Apr 20, 2020

ocefpaf Apr 20, 2020

mwengren Apr 20, 2020

ocefpaf Apr 20, 2020 •

edited

Loading

mwengren Apr 20, 2020

daltonkell Apr 20, 2020

mwengren Apr 20, 2020

benjwadams Apr 27, 2020

ocefpaf Apr 27, 2020

mwengren Apr 29, 2020

daltonkell Apr 29, 2020

daltonkell commented Apr 20, 2020

daltonkell commented Apr 20, 2020 •

edited

Loading

Bobfrat Apr 20, 2020

benjwadams Apr 24, 2020

benjwadams Apr 24, 2020

ocefpaf commented Apr 20, 2020 •

edited

Loading

benjwadams Apr 24, 2020

benjwadams Apr 24, 2020

benjwadams Apr 24, 2020

benjwadams Apr 24, 2020

benjwadams Apr 24, 2020

Enable ncCF format requests to TableDAP #799

Enable ncCF format requests to TableDAP #799

Conversation

daltonkell commented Apr 20, 2020

ocefpaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocefpaf Apr 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daltonkell commented Apr 20, 2020

daltonkell commented Apr 20, 2020 • edited Loading

Update

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocefpaf commented Apr 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocefpaf Apr 20, 2020 •

edited

Loading

daltonkell commented Apr 20, 2020 •

edited

Loading

ocefpaf commented Apr 20, 2020 •

edited

Loading