-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new features #11
Comments
In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:
This might also help decide which of the new methods is most worthwhile to implement. For example, we might decide to add new methods related to "insufficiently clean".
Do you mean it should be using something else? (1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem. What do you think? It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to open-contracting/data-registry#29 |
(1) ok
We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner. (7)(8)(9) ok |
Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information. Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating |
For 6 we can use logparser as described at #14 (comment) |
Hmm for drop items, looking at Kingfisher Collect, I see we drop items in 3 different scenarios;
So, the Should we just remove |
Yes, makes sense |
The library currently has:
File
vsFileError
)As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, open-contracting/kingfisher-collect#531, open-contracting/kingfisher-collect#1058 and open-contracting/kingfisher-collect#1055 I think we need to add:
item_dropped_count
vsitem_scraped_count
dupefilter/filtered
vsdownloader/request_count
https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requestsinvalid_json_count
vsitem_scraped_count
@jpmckinney any others? Do you agree with implementing all of them?
The text was updated successfully, but these errors were encountered: