Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceptance criteria - Kingfisher Collect #29

Open
hrubyjan opened this issue May 4, 2021 · 12 comments
Open

Acceptance criteria - Kingfisher Collect #29

hrubyjan opened this issue May 4, 2021 · 12 comments
Labels
component: orchestration operations Actions to be performed by administrators in the normal operation of the system
Milestone

Comments

@hrubyjan
Copy link

hrubyjan commented May 4, 2021

At the end of each phase of data processing we should evaluate whether it ended well, there is something suspicious or this particular phase failed.
For collect phase define criteria that will
a) prevent a dataset from being published in data registry

  • we shouldn't be too defensive and try to detect obvious problems

b) raise a warning but will not prevent dataset from being published

  • this should serve to notify admin that there is some issue worth inspecting

We should not insist on having some criteria if we will not see some meaningful rules

@jpmckinney
Copy link
Member

jpmckinney commented May 12, 2021

scrapy_log_file.py needs to be extracted from https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/scrapy_log_file.py to a small library.

Then, we can use it to apply a policy. Here's a sample policy: https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/crawl.py#L136-L169

Related: open-contracting-archive/kingfisher-archive#44

We want this as a library, so that it can also be used by Kingfisher Collect. open-contracting/kingfisher-collect#531

@hrubyjan
Copy link
Author

We can test on

  • Indonesia bandung
  • Kyrgyzstan
  • Mexico inai portal
  • Mexico quien es quien

These all went wrong in scrape phase, therefore, the task should fail and should not start the process task

@jpmckinney
Copy link
Member

If you update Collect, Mexico quien es quien will work again :)

@jpmckinney
Copy link
Member

Also, Mexico INAI portal no longer exists in Collect (if you update it).

@jpmckinney jpmckinney changed the title Acceptance criteria - collect Acceptance criteria - Kingfisher Collect Aug 28, 2021
@jpmckinney
Copy link
Member

@hrubyjan Where are the scrapyd log files?

@jpmckinney
Copy link
Member

Assigning only for last question for now.

@hrubyjan
Copy link
Author

Job context contains reference to a given log. For example you can run such command to get a log for scraping Kyrgyzstan data
curl http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log

{
    "job_id": "cadd2904064011ec95d5a8a159689b50",
    "spider": "kyrgyzstan",
    "pelican_id": 1104,
    "process_id": "477",
    "scrapy_log": "http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log",
    "process_id_pelican": 478,
    "pelican_dataset_name": "kyrgyzstan_2021-08-26T07:39:50_212",
    "process_data_version": "2021-08-26T07:39:50"
}

@hrubyjan
Copy link
Author

I'll add this information to Admin guide

@jakubkrafka jakubkrafka removed their assignment Aug 31, 2021
@hrubyjan hrubyjan removed their assignment Aug 31, 2021
@jpmckinney
Copy link
Member

Container files are also in the overlay2 directory.

@jpmckinney
Copy link
Member

@jpmckinney
Copy link
Member

Can also check the dropped items statistic (following idea from open-contracting/kingfisher-collect#1055)

@jpmckinney jpmckinney added operations Actions to be performed by administrators in the normal operation of the system and removed devops labels Apr 27, 2024
@jpmckinney
Copy link
Member

jpmckinney commented Dec 11, 2024

Re: notifications, we can have manageprocess print output if the job has warnings, or if the job failed due to failing whatever policy we come up with (e.g. error_rate > 0.5 ...).

To start, it's okay to not have any policy (other than no files to process, which already causes the process task to fail), and we can later decide on a policy based on what warnings we observe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: orchestration operations Actions to be performed by administrators in the normal operation of the system
Projects
None yet
Development

No branches or pull requests

3 participants