Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the schema file in all versions of the XML standard name file #470

Open
larsbarring opened this issue Mar 20, 2024 · 15 comments
Labels
enhancement Enhancements to the website's presentation or contents

Comments

@larsbarring
Copy link
Contributor

larsbarring commented Mar 20, 2024

This is one in a string of issues that aims to improve the format of the XML version of the standard name table files, see #457 for background and overview.

This particular issue implements the changes introduced by the following issues (and associated PRs):
#500 Standard names: Add "Conventions" string to the standard name xml table header
#509 In exceptional cases allow a standard name to be aliased into two alternatives
#511 Appendix B: New element in XML file header to record the "first published date"
#516 Update the XML format specification in Appendix B to provide a robust link to the XML schema file

By implementing a proper connection between the XML file and its corresponding original XSD file it was easy to pinpoint a few formal XML errors that are easy to correct, and will remain also with the updated schema file. As these errors in no way influence the material content related to the standard names and their definition etc. I suggest that they are corrected These are:

  • Version 1: <last_modified> DateTime is missing, and is not defined in schema file version 1.0 Add this information

  • Version 71: <last_modified> DateTime string is malformed: time component of the string is missing. Add this information

  • Version 12: Exact duplicate of standard name entry sea_surface_height_above_reference_ellipsoid. Remove duplicate entry

  • Versions 17 -- 22: Several standard name entries lack required tag <description>. Add empy tags

  • Versions 20 -- 26: One or several standard name entries lack required tag <canoncal_unit>. Add empty tags

@larsbarring
Copy link
Contributor Author

The changes outlined above can (will) be implemented in all published versions of the standard name XML file by a simple python program.

In a comment @DocOtak suggested that the alias elements should be sorted in alphabetical order according to the aliased standard name. I think this is a good idea that should be easy to implement in the python code.

@JonathanGregory
Copy link
Contributor

Thanks for finding these mistakes. Actually I think you could regard all these as defects, which means correcting them could be treated as a defect issue, though sorting the entries alphabetically would be an enhancement.

Are the standard names with no canonical units stated all string-valued quantities, I wonder? Empty string is fine to give in the xml of the standard name table, but perhaps we should clarify somewhere in Sect 3 of the CF standard that a string-valued quantity isn't required to have a units attribute at all, and the default is null for string-valued quantities, not 1 as for dimensionless numerical quantities. I don't think we say that at present, do we?

@JonathanGregory
Copy link
Contributor

Do any versions of the xml need to have their reference to the schema changed? Probably that's in one of the other issues. Sorry I have forgotten.

@larsbarring
Copy link
Contributor Author

larsbarring commented Mar 24, 2024

To answer your last comment first: yes, all XML files should get the new schema link. In fact this will happen in this issue, or in the associated PR.

Irrespective of whether there actually is a unit specified or not, the tag <canonical_units>XYZ</canoncial_units> (where "XYZ" might be the empty string) has to be present according to all versions of the schema (the old ones, as well as the new one). There are many occasions where this is the case, and there are in early version a few examples where XYZ is string.

@JonathanGregory
Copy link
Contributor

I think it's correct to leave put the null string in the canonical units in the XML file for string-valued quantities. For dimensionless numerical quantities, we should put 1 for the canonical unit (sect 3.3.1).

Do you think I'm right that we need to put some text in sect 3 about units for string-valued coordinates? Obviously that isn't something for this issue to deal with, if so - it's a separate matter.

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 25, 2024
Regarding cf-convention#469:
Just to test the workflow the current XSD link in XML files
points to my repo.
larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 25, 2024
@larsbarring
Copy link
Contributor Author

larsbarring commented Mar 26, 2024

This time I have not thought too much (at all) about the actual units as such because that is not something the XML syntax or XSD schema have influence over, which is what string of issues/PRs deals with. But I do agree that once these fundamental aspects are sorted then we could/should have a closer look at the units and other aspects that are related to the CF compliance as such.

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024
Regarding cf-convention#469:
Just to test the workflow the current XSD link in XML files
points to my repo.
larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024
larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024
Regarding cf-convention#469:
Just to test the workflow the current XSD link in XML files
points to my repo.
larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024
@larsbarring
Copy link
Contributor Author

I am not sure how to do this:

I my fork there is a branch/subdirectoy that contains the python code for actually injecting into all versions of the XML files the changes detailed in this issue, and the preceding ones. When running the codes the original xml files are kept (as *_SAVED.xml) and the new version is get the usual name. For "historic reasons" there are three pieces, doing different things.

Moreover, there are log files detailing the changes made by each step. But the XML files are not in the branch, because of size considerations. But I have spent some time trying to establish that the changes are as intended and do not corrupt some element, but this is not yet conclusive.

Just to get things working, the branch includes the changes suggested in previous issues/PRs. But I am not sure how to proceed from here. Should a PR include the codes and other details in the subdirectory linked above? Both the processed files and the original ("*_SAVED") versions are useful for verification, but that doubles the size.

I should also say that I have done the final step by creating new html files, see "next issue" in this string of issues.

Finally, as Andrew @DocOtak suggested there is the option to sort both the standard name entries and the alias entries, hampers the possibility to compare the old and the new files. But it would be useful as final step, because in particular the aliases are in some more or less random order now.

@JonathanGregory
Copy link
Contributor

Dear @larsbarring

I think the PR should replace all the xml and html files in the repo with the new versions. The size of the repo is not a problem; the 1 Gbyte limit refers to total space that the files take up on the website. If I understand correctly, you would be replacing all the xml and html files that appear on the website, but not increasing the number of them. It would also be useful to put the scripts into the repo, for the record.

I agree that sorting the entries, as @DocOtak suggested, is a good idea, but that could be done as a subsequent enhancement. There's no need to sort all the past versions, is there? Maybe there could be a future release of the table which did not change the entries, just put them in order, as a separate step.

@DocOtak also demonstrated how to tag all the versions so that they did not have to be kept on the website as static files. I think this works well for the xml, but GitHub doesn't render the html upon retrieving it. Hence I think we could adopt this approach for xml, which will save a bit les than half the space per release, but we will need to keep the html files on the website. Again, changing the way it's stored should be a subsequent enhancement, I think. It could be done at the same time as moving the standard name table to its own vocabulary repo, if we agree to do that.

Best wishes

Jonathan

@larsbarring
Copy link
Contributor Author

Yes, the PR will replace the existing xml and html files. When all the issues/PRs leading up to this one, I will do a more careful check that something odd is not happening. I have a fair idea how these checks can be done, but the details are for later. It is here where the eating of that proverbial pudding will happen, and the suitability and correctness of the previous string of issues/PRs will prove their worth. While I do not think so, or have any reason think so, there is always the possibility that something surfaces that requires changes to the previous steps.

I agree putting the script in the repo (with the caveat that is is not a nice "self-installing" python environment...).

Regarding sorting I believe Andrew's @DocOtak's argument that when it is sorted it is easier to create diffs that are readable/easy to follow between versions, which means that all versions should be sorted. From my perspective this is not difficult, it is just a small change in the code. But if we decide to do it, it should be the final step before publishing.

I agree that the approach Andrew demonstrated in the discussion tread is promising. I will come back to this when I have made a bit of more progress on this issue here.

Kind regards,
Lars

@larsbarring
Copy link
Contributor Author

Now when the string of preparatory issues (see table in #457) have been completed, much thanks to @sadielbartholomew's quick input and skilful reviewing of PRs, it is time to activate this issue. I will in the coming few days post overview tables of various "technical/formal" issues in the different versions of the standard name table XML file.

@larsbarring
Copy link
Contributor Author

larsbarring commented Apr 23, 2024

I have now made some progress on this. All versions of the Standard Name Table xml files have now been processed (locally), and the result have been through some first checks. The results looks good in that all versions now follow the same overall format, as specified in Appendix B and in the xsd schema files. Equally important is that the variety of formatting and other issues has been greatly reduced.

All details are available in a branch in my fork, where the README.md gives further details about the workflow.

The .pdf file below gives an overview table of the remaining xml syntax issues, which basically are of two types:

  • In some standard names there was a spurious <space> character. This is neither correct xml syntax, nor following CF requirements. In later versions this was corrected by aliasing the standard name, but a spurious <space> is neither allowed in an alias, which means that the xml syntax error remains.

There is now updated information available in a comment below

  • In version 26 several standard names were both defined and aliased. This is not accepted xml syntax (for the particular data type used). The details remains to be investigated, as there seems to be some other issues related to this particular version.
  • In addition, the duplicate entries identified in issue cf-convention/vocabularies#56 remains. Once the definitive details for how to fix this has been established it can easily be handled through these tools.

Remaining xml syntax issues (.pdf)

@larsbarring
Copy link
Contributor Author

larsbarring commented Apr 24, 2024

There is now updated information available in a comment below

In addition to what I wrote in the previous comment there is now also a summary deviations from CF Conventions requirements:
Deviations from the CF Conventions (.pdf)

In summary:

  • Typos in the canonical units, usually a space at the wrong place or similar
  • Standard names that are discontinued (one or two reappears later)
  • Version 26 definition and aliasing

In particular I think that the discontinued standard names, as well as what is going on in version 26, needs to be clarified.

ping @efisher008, @japamment, @davidhassell, @JonathanGregory

@larsbarring
Copy link
Contributor Author

All the format changes leading up to this issue has successfully implemented in newly published version 85 of the standard name table . Excellent work @japamment, @efisher008, @feggleton! :-))

@JonathanGregory
Copy link
Contributor

That's great, and thank you @larsbarring as well. Jonathan

@larsbarring
Copy link
Contributor Author

larsbarring commented May 29, 2024

Given the recent resolution of how to deal with standard names having a spurious space, I have updated my branch that includes some python code to implement the changes to the old versions of the table. Consequently, I have updated a couple of earlier comments in this thread.

While the changes to the published tables should be made by @japamment and @efisher008 to keep the CEDA Vocabulary Editor in sync, this repo includes log files and error summaries that could help in this work. Here are two pointers to help navigating the repo:

After the changes have been implemented only very few formal XML errors remain (see here), and these needs to be investigated in more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancements to the website's presentation or contents
Projects
None yet
Development

No branches or pull requests

2 participants