Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrip test fails on TK2023 EML files #3

Open
praseodym opened this issue Jan 8, 2024 · 5 comments
Open

Roundtrip test fails on TK2023 EML files #3

praseodym opened this issue Jan 8, 2024 · 5 comments

Comments

@praseodym
Copy link
Contributor

praseodym commented Jan 8, 2024

When running the roundtrip test on the TK2023 EML files, two files fail the test:

  • Telling_TK2023_gemeente_Oisterwijk.eml.xml fails due to timestamp with milliseconds (2023-11-23T19:38:38.000) in the original EML file, whereas the milliseconds part (.000) is not in the re-serialized file. Fixed by Custom datetime converter #4.
  • Telling_TK2023_NBSB.eml.xml fails due to a parsing error, this EML file is missing the required <Cast> element under the <ReportingUnitVotes> element.
Complete `pytest` log
================================================== test session starts ===================================================
platform darwin -- Python 3.12.1, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/mark/Developer/kiesraad/pyeml_bindings
collected 387 items                                                                                                      

tests/test_roundtrip.py .......................................................................................... [ 23%]
.................................................................................................................. [ 52%]
.......................................F..............................................F........................... [ 82%]
.....................................................................                                              [100%]

======================================================== FAILURES ========================================================
____ test_roundtrip[parser243-serializer243-print-data/Gemeente tellingen/Telling_TK2023_gemeente_Oisterwijk.eml.xml] ____

parser = XmlParser(config=<xsdata.formats.dataclass.parsers.config.ParserConfig object at 0x10744e140>, context=<xsdata.formats....nl/extensions', 'xal': 'urn:oasis:names:tc:ciq:xsdschema:xAL:2.0', 'xnl': 'urn:oasis:names:tc:ciq:xsdschema:xNL:2.0'})
serializer = XmlSerializer(config=<xsdata.formats.dataclass.serializers.config.SerializerConfig object at 0x1073ffe60>, context=<xs...XmlContext object at 0x10744e020>, writer=<class 'xsdata.formats.dataclass.serializers.writers.native.XmlEventWriter'>)
reporter = <built-in function print>, file = 'data/Gemeente tellingen/Telling_TK2023_gemeente_Oisterwijk.eml.xml'

    @pytest.mark.parametrize(
        "parser, serializer, reporter, file",
        test_cases,
    )
    def test_roundtrip(parser, serializer, reporter, file):
        name = Path(file).name
        if p_110a.match(name):
            assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml110a)
        elif p_230.match(name):
            assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml230)
        elif p_510.match(name):
>           assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml510)
E           AssertionError: assert False
E            +  where False = parsing_roundtrip_same(XmlParser(config=<xsdata.formats.dataclass.parsers.config.ParserConfig object at 0x10744e140>, context=<xsdata.formats....nl/extensions', 'xal': 'urn:oasis:names:tc:ciq:xsdschema:xAL:2.0', 'xnl': 'urn:oasis:names:tc:ciq:xsdschema:xNL:2.0'}), XmlSerializer(config=<xsdata.formats.dataclass.serializers.config.SerializerConfig object at 0x1073ffe60>, context=<xs...XmlContext object at 0x10744e020>, writer=<class 'xsdata.formats.dataclass.serializers.writers.native.XmlEventWriter'>), <built-in function print>, 'data/Gemeente tellingen/Telling_TK2023_gemeente_Oisterwijk.eml.xml', Eml510)

tests/test_roundtrip.py:57: AssertionError
-------------------------------------------------- Captured stdout call --------------------------------------------------
text: '2023-11-23T19:38:38.000' != '2023-11-23T19:38:38'
children 3 do not match: {http://www.kiesraad.nl/extensions}CreationDateTime
___________ test_roundtrip[parser290-serializer290-print-data/Gemeente tellingen/Telling_TK2023_NBSB.eml.xml] ____________

parser = XmlParser(config=<xsdata.formats.dataclass.parsers.config.ParserConfig object at 0x10744e140>, context=<xsdata.formats....nl/extensions', 'xal': 'urn:oasis:names:tc:ciq:xsdschema:xAL:2.0', 'xnl': 'urn:oasis:names:tc:ciq:xsdschema:xNL:2.0'})
serializer = XmlSerializer(config=<xsdata.formats.dataclass.serializers.config.SerializerConfig object at 0x1073ffe60>, context=<xs...XmlContext object at 0x10744e020>, writer=<class 'xsdata.formats.dataclass.serializers.writers.native.XmlEventWriter'>)
reporter = <built-in function print>, file = 'data/Gemeente tellingen/Telling_TK2023_NBSB.eml.xml'

    @pytest.mark.parametrize(
        "parser, serializer, reporter, file",
        test_cases,
    )
    def test_roundtrip(parser, serializer, reporter, file):
        name = Path(file).name
        if p_110a.match(name):
            assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml110a)
        elif p_230.match(name):
            assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml230)
        elif p_510.match(name):
>           assert parsing_roundtrip_same(parser, serializer, reporter, file, Eml510)

tests/test_roundtrip.py:57: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_roundtrip.py:18: in parsing_roundtrip_same
    parsed = parser.from_string(xml_in_string, type)
.venv/lib/python3.12/site-packages/xsdata/formats/bindings.py:22: in from_string
    return self.from_bytes(source.encode(), clazz)
.venv/lib/python3.12/site-packages/xsdata/formats/bindings.py:26: in from_bytes
    return self.parse(io.BytesIO(source), clazz)
.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/bases.py:48: in parse
    result = handler.parse(source)
.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/handlers/native.py:54: in parse
    return self.process_context(ctx)
.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/handlers/native.py:71: in process_context
    self.parser.end(
.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/bases.py:146: in end
    return item.bind(qname, text, tail, objects)
.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/nodes/element.py:81: in bind
    obj = self.config.class_factory(self.meta.clazz, params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'pyeml_bindings.mod_510_count_kiesraad_strict.ReportingUnitVotes'>
params = {'rejected_votes': [ReportingUnitVotes.RejectedVotes(value=2, reason=None, reason_code=<RejectedVotesReasonCode.ONGELD...er=None, referendum_option_identifier=None, valid_votes=2, value=None, category=None), ...], 'total_counted': 417, ...}

    def default_class_factory(cls: Type[T], params: Dict) -> T:
>       return cls(**params)  # type: ignore
E       TypeError: ReportingUnitVotes.__init__() missing 1 required keyword-only argument: 'cast'

.venv/lib/python3.12/site-packages/xsdata/formats/dataclass/parsers/config.py:7: TypeError
==================================================== warnings summary ====================================================
.venv/lib/python3.12/site-packages/formencode/validators.py:8
  /Users/mark/Developer/kiesraad/pyeml_bindings/.venv/lib/python3.12/site-packages/formencode/validators.py:8: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
    import cgi

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================ short test summary info =================================================
FAILED tests/test_roundtrip.py::test_roundtrip[parser243-serializer243-print-data/Gemeente tellingen/Telling_TK2023_gemeente_Oisterwijk.eml.xml] - AssertionError: assert False
FAILED tests/test_roundtrip.py::test_roundtrip[parser290-serializer290-print-data/Gemeente tellingen/Telling_TK2023_NBSB.eml.xml] - TypeError: ReportingUnitVotes.__init__() missing 1 required keyword-only argument: 'cast'
================================== 2 failed, 385 passed, 1 warning in 541.37s (0:09:01) ==================================
chrismostert added a commit that referenced this issue Jan 9, 2024
@chrismostert
Copy link
Contributor

Regarding the first point, this is caused by the way xsdata handles datetime formatting. I've tried monkey patching this formatting and this fixes the issue, but it feels a bit hacky. I guess the other option is to make the change in the xsdata package, but xsdata handles dates and times according to ISO 8601 which does not require the miliseconds (or at least from what I can see).

Regarding the second point, that EML files also fails to validate against the original XSD, so I feel it's fair that these bindings throw the same error. Perhaps this could be handled in a separate test case?

$ xmllint --schema 510-count-kiesraad-strict.xsd Telling_TK2023_NBSB.eml.xml --noout
Telling_TK2023_NBSB.eml.xml:1: element TotalCounted: Schemas validity error : Element '{urn:oasis:names:tc:evs:schema:eml}TotalCounted': This element is not expected. Expected is one of ( {urn:oasis:names:tc:evs:schema:eml}Selection, {urn:oasis:names:tc:evs:schema:eml}Cast ).
Telling_TK2023_NBSB.eml.xml:1: element TotalCounted: Schemas validity error : Element '{urn:oasis:names:tc:evs:schema:eml}TotalCounted': This element is not expected. Expected is one of ( {urn:oasis:names:tc:evs:schema:eml}Selection, {urn:oasis:names:tc:evs:schema:eml}Cast ).
Telling_TK2023_NBSB.eml.xml:1: element TotalCounted: Schemas validity error : Element '{urn:oasis:names:tc:evs:schema:eml}TotalCounted': This element is not expected. Expected is one of ( {urn:oasis:names:tc:evs:schema:eml}Selection, {urn:oasis:names:tc:evs:schema:eml}Cast ).
Telling_TK2023_NBSB.eml.xml:1: element TotalCounted: Schemas validity error : Element '{urn:oasis:names:tc:evs:schema:eml}TotalCounted': This element is not expected. Expected is one of ( {urn:oasis:names:tc:evs:schema:eml}Selection, {urn:oasis:names:tc:evs:schema:eml}Cast ).
Telling_TK2023_NBSB.eml.xml fails to validate

@praseodym
Copy link
Contributor Author

Thanks for the clarification on the first point. It looks like xsData decides to omit milliseconds in the timestamp when the millisecond part is zero, which is a shortsighted way of dealing with it (a bit like saying 1.000 = 1, losing significant figures). Monkeypatching definitely feels hacky. Maybe we can raise an issue with xsData upstream to come up with a better solution?

As for the second point, I'd say we have to figure out why this EML file does not follow the XSD (i.e. how can the OSV2020 software generate an invalid EML file?). I'd say this library has to be able to parse all EML files that the Kiesraad publishes, so skipping files that we deem invalid is not the right way to go.

@chrismostert
Copy link
Contributor

I agree that fixing the data serialization issue upstream feels like a better solution but we'll have to see if they agree with our point of view regarding the formatting. I'll raise an issue!

I don't fully agree with your second point, however. I agree that OSV2020 should always generate valid EML files, but for this election this unfortunately was not the case for the NBSB file. This should be fixed for the following elections.

One of the goals of this package is to give developers guarantees about the underlying data in terms of fields being present, being of the correct type etc. by strictly following the EML_NL specification so that you can be sure that if the data loads, it is complete and correct.

If we choose to relax these data validation rules in this package we lose some of these guarantees, which in my opinion goes against one of the goals of this package. I'd be more in favor of fixing the issue at the data level (i.e. fixing the EML file so that it complies with the standard). What do you think?

@praseodym
Copy link
Contributor Author

praseodym commented Jan 15, 2024

I agree that fixing the data serialization issue upstream feels like a better solution but we'll have to see if they agree with our point of view regarding the formatting. I'll raise an issue!

Sounds good!

I don't fully agree with your second point, however. I agree that OSV2020 should always generate valid EML files, but for this election this unfortunately was not the case for the NBSB file. This should be fixed for the following elections.

One of the goals of this package is to give developers guarantees about the underlying data in terms of fields being present, being of the correct type etc. by strictly following the EML_NL specification so that you can be sure that if the data loads, it is complete and correct.

If we choose to relax these data validation rules in this package we lose some of these guarantees, which in my opinion goes against one of the goals of this package. I'd be more in favor of fixing the issue at the data level (i.e. fixing the EML file so that it complies with the standard). What do you think?

That depends on whether the field should be required from a functional perspective. If keeping the field required in the XSD means we'd in some cases have to fill it with some kind of zero or null value, I think it would be better to make the field optional instead. And to clarify, I meant that we make it optional in the EML_NL standard and not just in this library.

If the field can always be filled with useful information, then it should have been filled by the software that created the EML file and that software should be fixed instead.

@chrismostert
Copy link
Contributor

That depends on whether the field should be required from a functional perspective. If keeping the field required in the XSD means we'd in some cases have to fill it with some kind of zero or null value, I think it would be better to make the field optional instead. And to clarify, I meant that we make it optional in the EML_NL standard and not just in this library.

If the field can always be filled with useful information, then it should have been filled by the software that created the EML file and that software should be fixed instead.

Ah, thanks for the clarification! I thought you suggested for this package to deviate from the standard but I fully agree that we should look into if this field should be made optional since the EML_NL standard has a different interpretation of this field. We (EML_NL) use it to indicate the amount of eligible voters, whereas the original EML standard intended it to be the total amount of cast votes (valid and invalid), see 8.24.1. It seems to me that there can always be a total amount of cast votes (0 is also a total), while we do not necessarily always know the amount of eligible voters for a given reporting unit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants