Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ncCF format requests to TableDAP #799

Closed
wants to merge 1 commit into from

Conversation

daltonkell
Copy link
Contributor

ERDDAP TableDAP Requests

When making a request to TableDAP, ensure the .ncCF (Contiguous ragged array) format is returned. In order for the data to be returned, the full URl must be generated listing all the variables as the TableDAP query. The binary data is fetched from the server, and then instantiated as a netCDF4.Dataset representation, allowing proper checking.

Users are only required to supply the base URL to the dataset, e.g.

$ compliance-checker -t ioos "http://data.glos.us/erddap/tabledap/glerlwe2"

This commit addresses #798.

It should be noted that this approach takes a significant amount of time to fetch the data, far longer than downloading the file and running the checker on a local file.

Unit tests are forthcoming after deliberating on this approach.

Copy link
Member

@ocefpaf ocefpaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I wish we could have a better solution at the ERDDAP server level though but re-reading the Google Group thread I don't think that is possible.

PS: I made some minor comments that are not too important.

io.BytesIO buffer object
"""

vstr = opendap.create_DAP_variable_str(url) # variable str from DDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: you called vstr varstr in the create_DAP_variable_str function, it could helps others reading your code to use the same name here. Also, your variable names are too short 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guilty as charged... I might be getting influenced by too many scientists 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait! I'm a scientists!! Am I doing my job wrong ;-p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're cut from a different cloth felipe!


# encode as proper URL characters
varstr = urllib.parse.quote(",".join(lst))
return varstr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is an ERDDAP trick to request the JSON with out any data, they this could be simply a request to get columnNames.

@@ -723,7 +723,13 @@ def load_remote_dataset(self, ds_str):
:param str ds_str: URL to the remote resource
'''

if opendap.is_opendap(ds_str):
if erddap.is_tabledap(ds_str):
return Dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is slower than downloading the file and running the checks locally as you say, could something like a temporary file be an alternative faster solution (I did not run any tests)?

Downloading the file locally and removing it when done could be something like:

@contextmanager
def tempnc(data: BinaryIO) -> Generator[str, None, None]:
    from tempfile import NamedTemporaryFile

    tmp = None
    try:
        tmp = NamedTemporaryFile(suffix=".nc", prefix="compliance-checker_")
        tmp.write(data)
        tmp.flush()
        yield tmp.name
    finally:
        if tmp is not None:
            tmp.close()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some way to limit the amount of data requested? The goal of the check is really to check the dataset structure, not so much the contents of the whole dataset. I'm not really sure how this part of CC works, but is there a row limit or something that can be included in the ERDDAP request, if we're not doing that already?

Copy link
Member

@ocefpaf ocefpaf Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few tests that requires the data/coordinates but a .cdl only test (metadata only) for online uses is something I've been wanting for a long time. Even though that would be a partial compliance test I believe that this is what we need 90%. However, do we have an ERDDAP response that returns only the metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, need to check the data for some tests. Not sure about this one though, mostly attribute-only, with some dimension checks like this one. Possibly forgetting some things tho.

There's the .das response, for example: http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.das.

This is what Bob pointed me to as most useful for checking dataset metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the OPeNDAP protocol, dimension information is typically be found in the DDS, at the .dds format, and would look something like this:

...
    Float64 mask_v[eta_v = 290][xi_v = 332];
    Float64 mask_psi[eta_psi = 290][xi_psi = 331];
    Float32 zeta[time = 762][eta_rho = 291][xi_rho = 332];
    Float32 u[time = 762][s_rho = 20][eta_u = 291][xi_u = 331];
    Float32 v[time = 762][s_rho = 20][eta_v = 290][xi_v = 332];
...

However, a request to an ERDDAP source doesn't bring back the same structure:

dalton@ubuntu:~$ curl -L "http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.dds"                                               
Dataset {
  Sequence {
    Float64 time;
    Float64 latitude;
    Float64 longitude;
    Float64 z;
    Float64 sea_water_velocity_to_direction;
    Int32 sea_water_velocity_to_direction_qc_agg;
    String sea_water_velocity_to_direction_qc_tests;
    Float64 sea_water_speed;
    Int32 sea_water_speed_qc_agg;
    String sea_water_speed_qc_tests;
    Float64 sea_water_temperature;
    Int32 sea_water_temperature_qc_agg;
    String sea_water_temperature_qc_tests;
    Float64 peak_wave_period;
    Int32 peak_wave_period_qc_agg;
    String peak_wave_period_qc_tests;
    Float64 sea_surface_wave_significant_height;
    Int32 sea_surface_wave_significant_height_qc_agg;
    String sea_surface_wave_significant_height_qc_tests;
    Float64 sea_surface_wave_from_direction;
    Int32 sea_surface_wave_from_direction_qc_agg;
    String sea_surface_wave_from_direction_qc_tests;
    String station;
  } s;
} s;

I'll dig around in the documentation to see if that's a possibility.

For the time being, the logic in this commit produces the desired results of requesting the .ncCF format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah. Duh, I've been staring at those forever, should have thought of that.

For the .das response, the reason that may not show dimensions like other DAP servers is that the different ERDDAP output formats have different dimensionality, so it can't account for each of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf: there's also the option to read a netCDF file into memory. It should probably be preferred over writing to a tempfile when possible and when the netCDF-C library has been compiled with in-memory support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But if we are aiming for a broader audience we should not rely on an option that may not be available, right?

PS: I believe that the conda-forge pkg is compiled with in-memory support but I'm not 100% sure. I'll check and fix it if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daltonkell where do we stand with merging this PR and resolving this dimension check issue? Interested to include this in the next RC, if possible.

Can we implement the dimension check, and maybe most or all of the IOOS checks, by requesting the .ncCFHeader response instead of the full .ncCF output? For some ERDDAP datasets, if there aren't limits placed on the request, any of the .nc, .ncCF, .ncCFMA, or even .csv output types could end up requesting a lot of data that could contribute to poor performance.

That may be too big a change though. If so, let's go with .ncCF and see what results are for next RC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including this in the next RC is appropriate, but we'll have to leave the .ncCFHeader request out for another edition. It's a great idea and perhaps we can get some good contributions for it, but it requires finding a way to "switch off" the checks which examine data (which would obviously fail).

I'm currently working on merging concepts from this PR and Ben's latest, #800, because his implements a useful abstraction for handling any remote netCDF resource.

@daltonkell
Copy link
Contributor Author

@ocefpaf The JSON and tempfile comments are very intriguing, I'll try to write some stuff up for that to see if it's feasible. Thanks for the feedback!

@daltonkell
Copy link
Contributor Author

daltonkell commented Apr 20, 2020

Update

Before I went and tested the tempfile route, I thought to test against a few more remote datasets. It's possible that the length of time is a direct result of where the data is coming from. For instance,

$ time python cchecker.py -t ioos "http://testing.erddap.axds.co/erddap/tabledap/sun2wave_timeseries"

...

real    0m2.121s
user    0m0.828s
sys     0m0.537s

That's reasonable.

This PacIOOS dataset seems to take longer though:

$ time python cchecker.py -t ioos "https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04"

 ...

real    0m23.115s
user    0m1.810s
sys     0m4.256s

In fact, in earlier tries this morning, this dataset took over 4 minutes to load up, obviously an unacceptable timeframe. 23 seconds is certainly not ideal, but...

I have a hunch this is a mixture of IO and computation. When ERDDAP receives a request for a particular format, it must generate that format at request-time. If that data were not cached, ERDDAP must then open all the necessary files and then generate the requested format, then send the bytes over -- much like TDS.

My first try at a SECOORA dataset:

$ time python cchecker.py -t ioos http://erddap.secoora.org/erddap/tabledap/edu_usf_marine_comps_1407d550

...

real    2m15.368s
user    0m1.204s
sys     0m6.769s

bool
"""

if "tabledap" in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using url.lower() here just to be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use return "tabledap" in url.lower() to simplify the return since it's returning boolean type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.

@ocefpaf
Copy link
Member

ocefpaf commented Apr 20, 2020

Before I went and tested the tempfile route, I thought to test against a few more remote datasets. It's possible that the length of time is a direct result of where the data is coming from.

If there is no speed gain I prefer the approach you have here. Dealing with tempfiles is always troublesome, specially on Windows.

bool
"""

if "tabledap" in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use return "tabledap" in url.lower() to simplify the return since it's returning boolean type.

bool
"""

if "tabledap" in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also can't necessarily guarantee "tabledap" will be in the URL, e.g. Apache/Nginx setups. Usually it will though.

with urllib.request.urlopen(f"{url}.dds") as resp:
strb = io.StringIO(resp.read().decode())

strb.seek(8) # remove "Dataset "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a context manager/with here.

compliance_checker/suite.py Show resolved Hide resolved
"""

vstr = opendap.create_DAP_variable_str(url) # variable str from DDS
_url = f'{".".join([url, ftype])}?{vstr}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to maintain backwards compatibility with Python 3 versions less than 3.6. For this reason, str.format() should be used instead of f-strings.

strb.close()

# remove beginning and ending braces, split on newlines
lst = list(filter(lambda x: "{" not in x and "}" not in x, x.split("\n")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to express this as a nested loop rather than reassigning to the same variable. If there are issues with the parsing, it makes it much easier to track down the issues because the variable reference isn't changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants