Implementation of the schema file in all versions of the XML standard name file #470

larsbarring · 2024-03-20T22:32:37Z

This is one in a string of issues that aims to improve the format of the XML version of the standard name table files, see #457 for background and overview.

This particular issue implements the changes introduced by the following issues (and associated PRs):
#500 Standard names: Add "Conventions" string to the standard name xml table header
#509 In exceptional cases allow a standard name to be aliased into two alternatives
#511 Appendix B: New element in XML file header to record the "first published date"
#516 Update the XML format specification in Appendix B to provide a robust link to the XML schema file

By implementing a proper connection between the XML file and its corresponding original XSD file it was easy to pinpoint a few formal XML errors that are easy to correct, and will remain also with the updated schema file. As these errors in no way influence the material content related to the standard names and their definition etc. I suggest that they are corrected These are:

Version 1: <last_modified> DateTime is missing, and is not defined in schema file version 1.0 Add this information
Version 71: <last_modified> DateTime string is malformed: time component of the string is missing. Add this information
Version 12: Exact duplicate of standard name entry sea_surface_height_above_reference_ellipsoid. Remove duplicate entry
Versions 17 -- 22: Several standard name entries lack required tag <description>. Add empy tags
Versions 20 -- 26: One or several standard name entries lack required tag <canoncal_unit>. Add empty tags

The text was updated successfully, but these errors were encountered:

larsbarring · 2024-03-21T12:37:09Z

The changes outlined above can (will) be implemented in all published versions of the standard name XML file by a simple python program.

In a comment @DocOtak suggested that the alias elements should be sorted in alphabetical order according to the aliased standard name. I think this is a good idea that should be easy to implement in the python code.

JonathanGregory · 2024-03-24T18:55:59Z

Thanks for finding these mistakes. Actually I think you could regard all these as defects, which means correcting them could be treated as a defect issue, though sorting the entries alphabetically would be an enhancement.

Are the standard names with no canonical units stated all string-valued quantities, I wonder? Empty string is fine to give in the xml of the standard name table, but perhaps we should clarify somewhere in Sect 3 of the CF standard that a string-valued quantity isn't required to have a units attribute at all, and the default is null for string-valued quantities, not 1 as for dimensionless numerical quantities. I don't think we say that at present, do we?

JonathanGregory · 2024-03-24T18:58:53Z

Do any versions of the xml need to have their reference to the schema changed? Probably that's in one of the other issues. Sorry I have forgotten.

larsbarring · 2024-03-24T21:31:21Z

To answer your last comment first: yes, all XML files should get the new schema link. In fact this will happen in this issue, or in the associated PR.

Irrespective of whether there actually is a unit specified or not, the tag <canonical_units>XYZ</canoncial_units> (where "XYZ" might be the empty string) has to be present according to all versions of the schema (the old ones, as well as the new one). There are many occasions where this is the case, and there are in early version a few examples where XYZ is string.

JonathanGregory · 2024-03-25T14:11:09Z

I think it's correct to leave put the null string in the canonical units in the XML file for string-valued quantities. For dimensionless numerical quantities, we should put 1 for the canonical unit (sect 3.3.1).

Do you think I'm right that we need to put some text in sect 3 about units for string-valued coordinates? Obviously that isn't something for this issue to deal with, if so - it's a separate matter.

Regarding cf-convention#469: Just to test the workflow the current XSD link in XML files points to my repo.

larsbarring · 2024-03-26T16:50:59Z

This time I have not thought too much (at all) about the actual units as such because that is not something the XML syntax or XSD schema have influence over, which is what string of issues/PRs deals with. But I do agree that once these fundamental aspects are sorted then we could/should have a closer look at the units and other aspects that are related to the CF compliance as such.

Regarding cf-convention#469: Just to test the workflow the current XSD link in XML files points to my repo.

larsbarring · 2024-03-26T18:47:21Z

I am not sure how to do this:

I my fork there is a branch/subdirectoy that contains the python code for actually injecting into all versions of the XML files the changes detailed in this issue, and the preceding ones. When running the codes the original xml files are kept (as *_SAVED.xml) and the new version is get the usual name. For "historic reasons" there are three pieces, doing different things.

Moreover, there are log files detailing the changes made by each step. But the XML files are not in the branch, because of size considerations. But I have spent some time trying to establish that the changes are as intended and do not corrupt some element, but this is not yet conclusive.

Just to get things working, the branch includes the changes suggested in previous issues/PRs. But I am not sure how to proceed from here. Should a PR include the codes and other details in the subdirectory linked above? Both the processed files and the original ("*_SAVED") versions are useful for verification, but that doubles the size.

I should also say that I have done the final step by creating new html files, see "next issue" in this string of issues.

Finally, as Andrew @DocOtak suggested there is the option to sort both the standard name entries and the alias entries, hampers the possibility to compare the old and the new files. But it would be useful as final step, because in particular the aliases are in some more or less random order now.

JonathanGregory · 2024-03-27T12:27:33Z

Dear @larsbarring

I think the PR should replace all the xml and html files in the repo with the new versions. The size of the repo is not a problem; the 1 Gbyte limit refers to total space that the files take up on the website. If I understand correctly, you would be replacing all the xml and html files that appear on the website, but not increasing the number of them. It would also be useful to put the scripts into the repo, for the record.

I agree that sorting the entries, as @DocOtak suggested, is a good idea, but that could be done as a subsequent enhancement. There's no need to sort all the past versions, is there? Maybe there could be a future release of the table which did not change the entries, just put them in order, as a separate step.

@DocOtak also demonstrated how to tag all the versions so that they did not have to be kept on the website as static files. I think this works well for the xml, but GitHub doesn't render the html upon retrieving it. Hence I think we could adopt this approach for xml, which will save a bit les than half the space per release, but we will need to keep the html files on the website. Again, changing the way it's stored should be a subsequent enhancement, I think. It could be done at the same time as moving the standard name table to its own vocabulary repo, if we agree to do that.

Best wishes

Jonathan

larsbarring · 2024-03-27T15:09:06Z

Yes, the PR will replace the existing xml and html files. When all the issues/PRs leading up to this one, I will do a more careful check that something odd is not happening. I have a fair idea how these checks can be done, but the details are for later. It is here where the eating of that proverbial pudding will happen, and the suitability and correctness of the previous string of issues/PRs will prove their worth. While I do not think so, or have any reason think so, there is always the possibility that something surfaces that requires changes to the previous steps.

I agree putting the script in the repo (with the caveat that is is not a nice "self-installing" python environment...).

Regarding sorting I believe Andrew's @DocOtak's argument that when it is sorted it is easier to create diffs that are readable/easy to follow between versions, which means that all versions should be sorted. From my perspective this is not difficult, it is just a small change in the code. But if we decide to do it, it should be the final step before publishing.

I agree that the approach Andrew demonstrated in the discussion tread is promising. I will come back to this when I have made a bit of more progress on this issue here.

Kind regards,
Lars

larsbarring · 2024-04-19T14:28:41Z

Now when the string of preparatory issues (see table in #457) have been completed, much thanks to @sadielbartholomew's quick input and skilful reviewing of PRs, it is time to activate this issue. I will in the coming few days post overview tables of various "technical/formal" issues in the different versions of the standard name table XML file.

larsbarring · 2024-04-23T16:12:52Z

I have now made some progress on this. All versions of the Standard Name Table xml files have now been processed (locally), and the result have been through some first checks. The results looks good in that all versions now follow the same overall format, as specified in Appendix B and in the xsd schema files. Equally important is that the variety of formatting and other issues has been greatly reduced.

~~All details are available in a branch in my fork, where the README.md gives further details about the workflow.~~

~~The .pdf file below gives an overview table of the remaining xml syntax issues, which basically are of two types:~~

In some standard names there was a spurious <space> character. This is neither correct xml syntax, nor following CF requirements. In later versions this was corrected by aliasing the standard name, but a spurious <space> is neither allowed in an alias, which means that the xml syntax error remains.

There is now updated information available in a comment below

In version 26 several standard names were both defined and aliased. This is not accepted xml syntax (for the particular data type used). The details remains to be investigated, as there seems to be some other issues related to this particular version.
In addition, the duplicate entries identified in issue cf-convention/vocabularies#56 remains. Once the definitive details for how to fix this has been established it can easily be handled through these tools.

~~Remaining xml syntax issues (.pdf)~~

larsbarring · 2024-04-24T12:00:08Z

There is now updated information available in a comment below

In addition to what I wrote in the previous comment there is now also a summary deviations from CF Conventions requirements:
Deviations from the CF Conventions (.pdf)

~~In summary:~~

~~Typos in the canonical units, usually a space at the wrong place or similar~~
~~Standard names that are discontinued (one or two reappears later)~~
~~Version 26 definition and aliasing~~

~~In particular I think that the discontinued standard names, as well as what is going on in version 26, needs to be clarified.~~

~~ping @efisher008, @japamment, @davidhassell, @JonathanGregory~~

larsbarring · 2024-05-23T15:25:24Z

All the format changes leading up to this issue has successfully implemented in newly published version 85 of the standard name table . Excellent work @japamment, @efisher008, @feggleton! :-))

JonathanGregory · 2024-05-23T15:57:06Z

That's great, and thank you @larsbarring as well. Jonathan

larsbarring · 2024-05-29T09:19:30Z

Given the recent resolution of how to deal with standard names having a spurious space, I have updated my branch that includes some python code to implement the changes to the old versions of the table. Consequently, I have updated a couple of earlier comments in this thread.

While the changes to the published tables should be made by @japamment and @efisher008 to keep the CEDA Vocabulary Editor in sync, this repo includes log files and error summaries that could help in this work. Here are two pointers to help navigating the repo:

All tools and documentations are located in the ISSUE-470-TOOLS subdirectory.
There is a README file.

After the changes have been implemented only very few formal XML errors remain (see here), and these needs to be investigated in more detail.

larsbarring added the enhancement Enhancements to the website's presentation or contents label Mar 20, 2024

larsbarring mentioned this issue Mar 20, 2024

Harmonize and improve XSD Schema files and their link to the XML standard name table files #457

Open

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 25, 2024

Added tools (python, bash) for processing XML files (cf-convention#470)

5d5dd8a

Regarding cf-convention#469: Just to test the workflow the current XSD link in XML files points to my repo.

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 25, 2024

Logs from running the tools (cf-convention#470)

6798300

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024

Added tools (python, bash) for processing XML files (cf-convention#470)

a169b8a

Regarding cf-convention#469: Just to test the workflow the current XSD link in XML files points to my repo.

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024

Logs from running the tools (cf-convention#470)

bc73ad9

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024

Added tools (python, bash) for processing XML files (cf-convention#470)

4a0877b

Regarding cf-convention#469: Just to test the workflow the current XSD link in XML files points to my repo.

larsbarring added a commit to larsbarring/cf-convention.github.io that referenced this issue Mar 26, 2024

Logs from running the tools (cf-convention#470)

77aafe0

larsbarring mentioned this issue Mar 26, 2024

Update the XML format specification in Appendix B to provide a robust link to the XML schema file cf-convention/cf-conventions#516

Closed

larsbarring mentioned this issue Apr 6, 2024

Harmonise content of the schema definition files #459

Closed

This was referenced Apr 18, 2024

Publication of the standard name table XML schema file on the website #469

Closed

Minor update to the newly added XML schema file #481

Closed

larsbarring mentioned this issue Apr 23, 2024

Standard names: drainage_amount_through_base_of_soil_model duplicated in final XML file at website cf-convention/vocabularies#56

Open

larsbarring mentioned this issue May 31, 2024

How to deal with standard names having a <space> character cf-convention/vocabularies#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of the schema file in all versions of the XML standard name file #470

Implementation of the schema file in all versions of the XML standard name file #470

larsbarring commented Mar 20, 2024 •

edited

Loading

larsbarring commented Mar 21, 2024

JonathanGregory commented Mar 24, 2024

JonathanGregory commented Mar 24, 2024

larsbarring commented Mar 24, 2024 •

edited

Loading

JonathanGregory commented Mar 25, 2024

larsbarring commented Mar 26, 2024 •

edited

Loading

larsbarring commented Mar 26, 2024

JonathanGregory commented Mar 27, 2024

larsbarring commented Mar 27, 2024

larsbarring commented Apr 19, 2024

larsbarring commented Apr 23, 2024 •

edited

Loading

larsbarring commented Apr 24, 2024 •

edited

Loading

larsbarring commented May 23, 2024

JonathanGregory commented May 23, 2024

larsbarring commented May 29, 2024 •

edited

Loading

Implementation of the schema file in all versions of the XML standard name file #470

Implementation of the schema file in all versions of the XML standard name file #470

Comments

larsbarring commented Mar 20, 2024 • edited Loading

larsbarring commented Mar 21, 2024

JonathanGregory commented Mar 24, 2024

JonathanGregory commented Mar 24, 2024

larsbarring commented Mar 24, 2024 • edited Loading

JonathanGregory commented Mar 25, 2024

larsbarring commented Mar 26, 2024 • edited Loading

larsbarring commented Mar 26, 2024

JonathanGregory commented Mar 27, 2024

larsbarring commented Mar 27, 2024

larsbarring commented Apr 19, 2024

larsbarring commented Apr 23, 2024 • edited Loading

larsbarring commented Apr 24, 2024 • edited Loading

larsbarring commented May 23, 2024

JonathanGregory commented May 23, 2024

larsbarring commented May 29, 2024 • edited Loading

larsbarring commented Mar 20, 2024 •

edited

Loading

larsbarring commented Mar 24, 2024 •

edited

Loading

larsbarring commented Mar 26, 2024 •

edited

Loading

larsbarring commented Apr 23, 2024 •

edited

Loading

larsbarring commented Apr 24, 2024 •

edited

Loading

larsbarring commented May 29, 2024 •

edited

Loading