FEAT Parse model cards with pandoc #257

BenjaminBossan · 2022-12-16T16:09:30Z

As discussed here:

For maintainers: No need to review anything, I just updated this branch after merging the alternative model card implementation and pointed the PR to the main repo instead of my fork.

Description

This feature adds a new function, skops.card.parse_modelcard. When passing it the path to a dumped model card, it parses it using pandoc and returns a Card object, which can be further modified by the user.

Example:

numpy as np
sklearn.linear_model import LinearRegression
skops.card import Card

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
regr = LinearRegression().fit(X, y)
card = Card(regr)
card.save("README.md")

# later, load the card again
skops.card import parse_card

parsed_card = parse_modelcard("README.md")

# continue editing the card
parsed_card.add(**{"My new section": "My new content"})

# overwrite old card with new one
parsed_card.save("README.md")

There is a test that shows that cards with tables and plots are already possible to be parsed, reproducing the original content 1:1, with the exception of line breaks. The latter should generally be conserved, but there are a few cases where there is one too many, but IMO it doesn't matter.

Comment

In the end, this turned out easier than I initially thought it would. The main difficulty are the data structures returned by the pandoc parser, for which I couldn't find any documentation. I guess Haskell code is just so self-documenting it's not needed.

For this reason (and out of laziness), there are probably quite a few edge cases that I haven't covered yet. Just as an example, when parsing tables, pandoc tells us how the columns are aligned. This information is currently completely discarded (we let tabulate choose the alignment). If we want to preserve the table alignment, we would need to make some changes.

I think, however, that it's fine not to support everything out of the box. We can incrementally add support for the rest as we collect more experience.

Implementation

This feature requires the alternative card implementation from #203

pandoc is used for the following reasons:

widely used and thus battle tested
can read many other formats, not just markdown, so in theory, we should be able to read, e.g., rst model cards without modifying any code

The disadvantage is that pandoc is not a Python package, so users need to install it separately. But it is available on all common platforms.

For calling pandoc, I chose to shell out using subprocess. I think this should be fine but LMK if there is a better way.

There is a Python package that binds pandoc (https://github.com/boisgera/pandoc) but I don't think it's worth it for us to add it, just to avoid shelling out. The package seems to have low adoption and contains a bunch of stuff we don't need.

I chose to implement this feature such that the parser that generates the Card object should not have to know anything about markdown. Everything related to markdown is moved to a separate class in _markup.py which could be extended in the future to support other markup languages.

In an ideal world, we would not have to know anything about markdown at all. Instead the Card object should have methods (similar to what we already have for add_table etc.) that handles all special cases in a markup-language agnostic way. But in practice, this is far from being true. E.g. if a user wants to add bold text, there is no special method for it, so they would need to add raw markdown. The Card class is thus a leaky abstraction, which requires the parsing to directly deal with markdown.

TODOs

This PR is not finished. Remaining TODOs that come to mind:

~~Documentation has to be updated in several places~~
Possibly add more unit tests -- right now, it is only checked that parsed and original card markdown are (almost) identical
~~CI needs to install pandoc so that the tests are actually run~~ pandoc is installed in CI for the ubuntu built, not windows or mac -- IMO that's enough.
~~There are some specifics here that won't work with all Python versions, like the use of TypedDict~~
~~Clean up a couple of TODO comments, docstrings, and asserts~~
Do we want a card.load method that uses pandoc parsing?

This is a suggestion to address the issues discussed on skops-dev#72 Description The proposed model card implementation would allow to dynamically add sections or overwrite them. This is not a complete implementation but already covers most of the features we already have and then some. On top of these features, it would be possible to add more features like creating a default Card with placeholders, just like the exisint template, or the possibility to delete existing sections or to retrieve the result of a certain section. Implementation The underlying data structure consists of a dict and a Section dataclass. All data is stored in a _data attribute with the type dict[str, Section]. The dataclass hold the section contents, i.e. the section title, the section content, and subsections, which again have the same type. It's thus recursive data structure. Section title and dict key are identical, which is mostly for convenience. With this refactor, there are no separate data containers anymore for eval results, template sections, extra sections, etc. They are all treated the same. IMHO, this greatly simplifies the code overall. The only complex function that's left is the one needed to traverse the tree holding the data, and even that is just 14 LOC. Demo To see how the new class can be used, take a look at the main function. The resulting Card can be seen here: https://huggingface.co/skops-ci/hf_hub_example-fcc0d6fe-d072-4f94-8fdb-6bf3bb917bca

Added a test that shows that the new card produces the same output as the old card (except for a few non-deterministic parts). This includes most of the idiosyncrasies of the old card we might want to change in the future (e.g. inconsistent capitalization, use of empty lines). Some of the more problematic behaviors of the old card class were, however, fixed (e.g. creating an empty metrics table when there are no metrics). The other tests have been reworked to use the new card features to make them more precise. Often, that means that instead of having a very weak test like "assert 'foo' in card.render()", it is now possible to select the exact section and check that it equals the expected output. This work is still unfinished, specifically it still lacks tests for the card repr and for the newly added features.

Some refactoring to clean up things, rework repr, make repr tests pass.

Also some better type annotations.

- Remove noise from docstring example - Add the comma after model repr - Add docstrings to private methods

Users can now choose to use no template, skops template, hub template, or their own template. Using their own template disables a lot of prefilling (say, putting the model plot in the card) because we wouldn't know where to put it. Users will need to call card.add for the otherwise prefilled sections.

This can be useful, because otherwise it takes a bit of effort to retrieve the latest section.

It's ugly, but there is no technical reason from prohibiting the addition of tables without rows. (Note, columns are still required). This allows us to use TableSection for formatting the metrics, instead of calling tabulate there directly. This is better, since we don't have 2 separate ways of creating metrics.

As discussed here: skops-dev#72 (comment) Description This feature adds a new function, skops.card.parse_modelcard. When passing it the path to a dumped model card, it parses it using pandoc and returns a Card object, which can be further modified by the user. In the end, this turned out easier than I initially thought it would. The main difficulty are the data structures returned by the pandoc parser, for which I couldn't find any documentation. I guess Haskell code is just self-documenting. For this reason, there are probably quite a few edge cases that I haven't covered yet. Just as an example, when parsing tables, pandoc tells us how the columns are aligned. This information is currently completely discarded (we let tabulate choose the alignment). If we want to preserve the table alignment, we would need to make some changes Implementation This feature requires the alternative card implementation from skops-dev#203 pandoc is used for the following reasons: - widely used and thus battle tested - can read many other formats, not just markdown, so in theory, we should be able to read, e.g., rst model cards without modifying any code The disadvantage is that pandoc is not a Python package, so users need to install it separately. But it is available on all common platforms. For calling pandoc, I chose to shell out using subprocess. I think this should be fine but LMK if there is a better way. There is a Python package that binds pandoc (https://github.com/boisgera/pandoc) but I don't think it's worth it for us to add it, just to avoid shelling out. The package seems to have low adoption and contains a bunch of stuff we don't need. I chose to implement this such that the parser that generates the Card object should not have to know anything about Markdown. Everything related to Markdown is moved to a separate class in _markup.py. In an ideal world, we would not have to know anything about markdown either. Instead the Card object shoud have methods (similar to what we already have for add_plot etc.) that handles all of that. But in practice, this is far from being true. E.g. if a user wants to add bold text, there is no special method for it, so they would need to add raw Markdown. The Card class is thus a leaky abstraction. TODOs This PR is not finished. Remaining TODOs that come to mind: 1. We need to merge the alternative card implementation 2. Documentation has to be updated in several places 3. Tests need to be more complex, right now only one Card is tested 4. CI needs to install pandoc so that the tests are actually run 5. There are some specifics here that won't work with all Python versions, like the use of TypedDict.

…-cards-pandoc

Most model cards have a yaml section at the top. It is now detached before parsing with pandoc, then re-added afterwards. Add a test with the model card from bert-base-uncased. It still fails with some minor issues at the moment (most notably table column alignment).

Now supports: - Space - Strong - Emph - Strikeout - Subscript - Superscript - Plain - Str - RawInline - RawBlock - SoftBreak - LineBreak - Para - Header - Image - CodeBlock - Code - Table - Div - Link - BulletList - Quoted

I added 5 model cards from the top 10 most used models from the Hub (I excluded cards that were too similar to one another). The tests were rewritten so that they should now pass. There are some limitations to the parser that results in the generated cards not being 100% identical. Those limiations are now documented. However, I don't believe that those limitations matter, as they make no semantic difference but rather are stylistic or even invisible. The most notable difference is the alignment of columns in tables. The tests pass despite those differences because I rewrote them to include a diff file for each model card. When the generated card is compared to the original one, a diff is created and compared to the checked in diff. This way, we have control over what diff we permit. I had to exclude the folder containing the cards and diffs from the pre-commit task "trailing-whitespace", as we need the trailing whitespace in there.

There was a change in pandoc, this now works with the latest pandoc version.

For some reason, the order of items in the metainfo is no longer stable. Therefore, the tests comparing the parsed card vs original card failed. Now metainfo is excluded when comparing the cards. The metainfo is now checked separately, in a way that disregards the order.

It contained remnants from an old test that has been removed since then.

BenjaminBossan · 2023-01-17T16:49:20Z

@skops-dev/maintainers I think this is now ready for review. For the full description, please look at the original comment.

More as a meta comment, I know this will probably not be an easy review. However, I think since this is more of a "side feature" and doesn't touch the skops "core", it is okay if it isn't perfect. There should be no problem making changes later on without breaking BC, e.g. by supporting a bigger subset of markdown features.

The testing is done in a kinda simple fashion, where I take a model card markdown, parse it, dump it again into markdown, then compare the diff to the original card. If the diff corresponds to the expected diff, the test passes. The test module should hopefully contain enough comments to explain these steps.

Was added to v0.4 but that version is already released.

adrinjalali

Just a few nits, otherwise I think this is good to go! Awesome work!

.github/workflows/build-test.yml

docs/model_card.rst

skops/card/_markup.py

adrinjalali · 2023-01-19T16:20:21Z

skops/card/_parser.py

+            # not illegal in markdown, but we don't handle it yet.
+            raise ValueError(
+                "Trying to add content but there is no current section, "
+                "this is probably a bug, please open an issue on GitHub"


Suggested change

"this is probably a bug, please open an issue on GitHub"

"this is probably a bug, please open an issue on GitHub."

not sure why you like to miss the dots at the end of exception messages :D

Not sure, probably because Python error messages also omit the period when it's only one sentence?

adrinjalali · 2023-01-19T16:27:46Z

skops/card/_parser.py

+def _check_version_greater_equal(
+    version: Sequence[int], min_version: Sequence[int]
+) -> None:
+    """Very bad version comparison function to ensure that the first version is


why not use packaging.version instead?

We had a similar discussion in skorch: skorch-dev/skorch#887

We decided not to use packaging because we didn't want to have a dependency on it. It seems that in skops, it is available (probably a transient dependency), but it's not an explicit requirement. If you are okay with explicitly depending on packaging, I can make the change.

setuptools depends on it, I don't think packaging is a heavy dependency, I'm happy to include it as an explicit dep.

BenjaminBossan

@adrinjalali I addressed your comments, pls review again.

.github/workflows/build-test.yml

docs/model_card.rst

BenjaminBossan · 2023-01-20T10:38:02Z

skops/card/_parser.py

+            # not illegal in markdown, but we don't handle it yet.
+            raise ValueError(
+                "Trying to add content but there is no current section, "
+                "this is probably a bug, please open an issue on GitHub"


Not sure, probably because Python error messages also omit the period when it's only one sentence?

BenjaminBossan · 2023-01-20T10:45:54Z

skops/card/_parser.py

+def _check_version_greater_equal(
+    version: Sequence[int], min_version: Sequence[int]
+) -> None:
+    """Very bad version comparison function to ensure that the first version is


We had a similar discussion in skorch: skorch-dev/skorch#887

We decided not to use packaging because we didn't want to have a dependency on it. It seems that in skops, it is available (probably a transient dependency), but it's not an explicit requirement. If you are okay with explicitly depending on packaging, I can make the change.

Add packaging as an explicit dependency. Kinda arbitrarily set min version to be 17.0, which was released 5 years ago.

BenjaminBossan · 2023-01-20T14:20:59Z

@adrinjalali I went with your apt install suggestion, but that has a very old version of pandoc (2 years), which creates problems with the parse results. I could try to make it work, but honestly I don't think it's worth the effort. There might be ppa's with newer pandoc versions, but at that point we could as well revert back to install from github releases, as is recommended, no?

(btw, yesterday pandoc 3.0 was released, I hope there is no breakage D:)

adrinjalali · 2023-01-23T11:37:15Z

So what I'm concerned about, is that when users are recommended to install a .deb file, then the package doesn't get upgraded. That's why installing from a ppa would be much preferred.

But I guess we don't want to deal with that, and that only applies to ubuntu. So I'm fine with going back to what you had in the CI.

Apt repo is very outdated.

BenjaminBossan · 2023-01-23T13:08:10Z

So what I'm concerned about, is that when users are recommended to install a .deb file, then the package doesn't get upgraded. That's why installing from a ppa would be much preferred.

Yes, I totally agree. However, I feel this is more of an issue with pandoc distribution in general and not so much for us. For CI, I think we can live with the deb solution.

We could consider recommending a different way of installing pandoc in our docs, but I'm not sure if we should deviate from the officially recommended method.

Figure was added in Pandoc v3.0 and is more complex than Image, but also more powerful. At the moment, we treat Figure just as Image.

adrinjalali

WDYT of putting # pragma: no cover for lines which are not supposed to be covered?

Otherwise LGTM.

adrinjalali · 2023-01-23T14:17:25Z

skops/card/_markup.py

+        content = f"![{caption}]({dest})"
+        return content
+
+    def _figure(self, value) -> str:


seems like the whole _figure is not tested.

Yep, Figure is a new type that was added in pandoc 3.0. I didn't want to switch to 3.0 for CI because it is freshly released, so probably almost no one is using it. And having pandoc 2.x + 3.x in the CI matrix seemed like overkill to me.

BenjaminBossan · 2023-01-23T17:06:09Z

WDYT of putting # pragma: no cover for lines which are not supposed to be covered?

In general, that SGTM.

Just for me understanding: The idea would be that if we know that a line is not covered, we add this so that codecov et al don't complain about this line, and in order for us to signal that we are aware that this line is not tested? Would that be for all of skops or for this feature specifically?

I think there is a danger of using the pragma: IIUC, if we do end up covering the line (even if unintentionally) by adding a test, we would not get an error that the line is covered now even though we "claim" it's not covered. Then, if someone else comes and, say, modifies the test so that the line no longer is covered, we would not get a warning about decreased coverage, right? So I feel like using the pramga would weaken the reason of tracking code coverage in the first place.

(This is on contrast, e.g., to using pytest with xfail, were one can indicate that the xfail is strict and thus it's hard to unintentionally make a test pass without noticing.)

adrinjalali · 2023-01-24T09:25:12Z

If we write a test that covers those lines, we'd know, right? They're mostly raising exceptions which we'd be catching in the tests, and therefore we'd be removing the pragma.

To me this is very different than xfail. This is about us not necessarily caring about a line being covered, or that we know it's not covered.

The point of codecov is not to make sure every single line is covered. It's about us knowing if things which we should be testing are tested or not, and it's doing that for us.

Those are things that don't really make sense to be tested.

- test error when trying to use markup != markdown - test no version check for pandoc - test min pandoc version too low - test pandoc not installed - test pandoc version cannot be determined

BenjaminBossan · 2023-01-24T11:43:04Z

@adrinjalali I added pragmas where it makes sense and tests for all the remaining lines that were not tested.

If we write a test that covers those lines, we'd know, right? They're mostly raising exceptions which we'd be catching in the tests, and therefore we'd be removing the pragma.

Yes, for these specific lines, that's true.

To me this is very different than xfail. This is about us not necessarily caring about a line being covered, or that we know it's not covered.

I only brought that up in the sense that it has the strict option, which would be nice to have for pragma: no cover as well.

BenjaminBossan added 29 commits October 25, 2022 16:47

Make tests pass

166e33d

Some refactoring to clean up things, rework repr, make repr tests pass.

Add tests for new functionalities and docstrings

a90e3a5

Adjust tests to work with older sklearn versions

fcbd0df

Adjust examples to new card, more docs

4f10f80

Also some better type annotations.

Continue fixing tests

0f5df2e

Try fixing Windows error by specifying encoding

6212d67

Adjust doctest: confusion matrix not stored in cwd

80d218d

Increase test coverage

c6c8042

Try fixing test failure on Windows

e7a2ae3

Replace old by new Card implementation

2a6aab0

Merge branch 'main' into alternative-model-card-implementation

ce16019

Address reviewer comments

00b7665

- Remove noise from docstring example - Add the comma after model repr - Add docstrings to private methods

Add TODO notes for when Python 3.7 is dropped

8558f26

Make _add_single return the Section

772a659

This can be useful, because otherwise it takes a bit of effort to retrieve the latest section.

Error when calling add_metric w/ invalid template

26892d6

Merge branch 'alternative-model-card-implementation' into parse-model…

9646103

…-cards-pandoc

Add a bunch of more markup support

9744998

Now supports: - Space - Strong - Emph - Strikeout - Subscript - Superscript - Plain - Str - RawInline - RawBlock - SoftBreak - LineBreak - Para - Header - Image - CodeBlock - Code - Table - Div - Link - BulletList - Quoted

Rename Markdown class's methods, make private

5e5c1a3

Merge branch 'main' into parse-model-cards-pandoc

1ff15be

Add version check for pandoc

8979751

New table parsing code

e8d6f61

There was a change in pandoc, this now works with the latest pandoc version.

BenjaminBossan mentioned this pull request Dec 16, 2022

Parser that allows to load dumped model cards BenjaminBossan/skops#2

Closed

BenjaminBossan added 2 commits January 17, 2023 16:40

Fix incorrect link in docs

5da3397

Clean up test_parser.py

2cb998e

It contained remnants from an old test that has been removed since then.

BenjaminBossan marked this pull request as ready for review January 17, 2023 16:44

Move changes entry to v0.5

1922bc1

Was added to v0.4 but that version is already released.

adrinjalali reviewed Jan 19, 2023

View reviewed changes

BenjaminBossan added 2 commits January 20, 2023 11:30

Merge branch 'main' into parse-model-cards-pandoc

e684c78

Reviewer comments: wording, period in err msg

709e079

BenjaminBossan commented Jan 20, 2023

View reviewed changes

BenjaminBossan added 4 commits January 20, 2023 14:34

Use packaging Version for version comparison

08f44c4

Add packaging as an explicit dependency. Kinda arbitrarily set min version to be 17.0, which was released 5 years ago.

Install pandoc with apt in CI

7c416c4

Investigate pandoc version issue

b855e50

Lower min required pandoc version to 2.9.0

2da5ea6

BenjaminBossan added 2 commits January 23, 2023 14:04

Revert pandoc install to use gh + deb

6d1f288

Apt repo is very outdated.

Merge branch 'main' into parse-model-cards-pandoc

d6d8e0a

BenjaminBossan added 2 commits January 23, 2023 14:32

Add Figure type to parsing

fb46a76

Figure was added in Pandoc v3.0 and is more complex than Image, but also more powerful. At the moment, we treat Figure just as Image.

Merge branch 'main' into parse-model-cards-pandoc

04c5756

adrinjalali approved these changes Jan 23, 2023

View reviewed changes

BenjaminBossan added 2 commits January 24, 2023 12:17

Add pragmas for lines that are covered

30d3e5c

Those are things that don't really make sense to be tested.

Add more unit tests, get to 100% coverage

de1837e

- test error when trying to use markup != markdown - test no version check for pandoc - test min pandoc version too low - test pandoc not installed - test pandoc version cannot be determined

Fix incorrect docstring

6472ebf

adrinjalali changed the title ~~Parse model cards pandoc~~ FEAT Parse model cards with pandoc Jan 24, 2023

adrinjalali merged commit a860a12 into skops-dev:main Jan 24, 2023

BenjaminBossan deleted the parse-model-cards-pandoc branch January 24, 2023 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT Parse model cards with pandoc #257

FEAT Parse model cards with pandoc #257

BenjaminBossan commented Dec 16, 2022 •

edited

Loading

BenjaminBossan commented Jan 17, 2023

adrinjalali left a comment

adrinjalali Jan 19, 2023

BenjaminBossan Jan 20, 2023

adrinjalali Jan 19, 2023

BenjaminBossan Jan 20, 2023

adrinjalali Jan 20, 2023

BenjaminBossan left a comment

BenjaminBossan Jan 20, 2023

BenjaminBossan Jan 20, 2023

BenjaminBossan commented Jan 20, 2023

adrinjalali commented Jan 23, 2023

BenjaminBossan commented Jan 23, 2023

adrinjalali left a comment

adrinjalali Jan 23, 2023

BenjaminBossan Jan 23, 2023

BenjaminBossan commented Jan 23, 2023

adrinjalali commented Jan 24, 2023

BenjaminBossan commented Jan 24, 2023 •

edited

Loading

	"this is probably a bug, please open an issue on GitHub"
	"this is probably a bug, please open an issue on GitHub."

FEAT Parse model cards with pandoc #257

FEAT Parse model cards with pandoc #257

Conversation

BenjaminBossan commented Dec 16, 2022 • edited Loading

Description

Comment

Implementation

TODOs

BenjaminBossan commented Jan 17, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Jan 20, 2023

adrinjalali commented Jan 23, 2023

BenjaminBossan commented Jan 23, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Jan 23, 2023

adrinjalali commented Jan 24, 2023

BenjaminBossan commented Jan 24, 2023 • edited Loading

BenjaminBossan commented Dec 16, 2022 •

edited

Loading

BenjaminBossan commented Jan 24, 2023 •

edited

Loading