Add new features #11

yolile · 2024-04-24T20:55:27Z

The library currently has:

crawl_time
is_finished
item_counts
spider_arguments
is_complete
error_rate (but using File vs FileError)

As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, open-contracting/kingfisher-collect#531, open-contracting/kingfisher-collect#1058 and open-contracting/kingfisher-collect#1055 I think we need to add:

drop_rate: item_dropped_count vs item_scraped_count
duplicate_request_rate: dupefilter/filtered vs downloader/request_count https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
invalid_json_rate: invalid_json_count vs item_scraped_count
finish_reason: cancelled, finished, etc as documented https://kingfisher-collect.readthedocs.io/en/latest/logs.html#check-the-reason-for-closing-the-spider
errors_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
errors_list: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
response_status_counts: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-response-status-codes
exception_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-downloader-exceptions
retry_times_reached: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-requests-for-which-the-maximum-number-of-retries-was-reached

@jpmckinney any others? Do you agree with implementing all of them?

The text was updated successfully, but these errors were encountered:

jpmckinney · 2024-04-25T00:31:24Z

In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:

decide whether a collection is good (at all) https://github.com/open-contracting-archive/kingfisher-archive/blob/0740a02fe9cbd0b8e74362652511ca33ea039fd2/ocdskingfisherarchive/crawl.py#L136-L169
decide whether a collection is better than another https://github.com/open-contracting-archive/kingfisher-archive/blob/0740a02fe9cbd0b8e74362652511ca33ea039fd2/ocdskingfisherarchive/crawl.py#L257-L317

This might also help decide which of the new methods is most worthwhile to implement.

For example, we might decide to add new methods related to "insufficiently clean".

error_rate (but using File vs FileError)

Do you mean it should be using something else?

(1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem.
(2) Where is duplicate request rate from?
(3) Maybe we should change the middleware to yield a FileError? That way, invalid JSON gets counted in the error rate. (Invalid JSON does seem like an error.)
(4) I think is_finished is all that matters. The reason itself can be accessed as logparser["crawler_stats"] – we don't need a new method for a simple access.
(5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.
(7) In principle, these should lead to FileError items. I think these might just be for reporting (e.g. logreport prints those out from logparser["crawler_stats"], without needing a new method in this package).
(8) I think this can be covered with 5-6.
(9) This is more to debug retries. In terms of collection quality, URLs that reach the max retries end up being FileErrors. Maybe we just report the retry/max_reached and let the user decide whether to investigate.

What do you think?

It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to open-contracting/data-registry#29

yolile · 2024-04-25T19:10:01Z

(1) ok
(2) I've edited the issue, but from https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
(4) Ok

(3) and
(5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.

We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner.

(7)(8)(9) ok

jpmckinney · 2024-04-25T21:59:05Z

Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information.

Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating error_rate, we should include invalid JSON.

jpmckinney · 2024-12-12T14:35:32Z

For 6 we can use logparser as described at #14 (comment)

yolile · 2024-12-12T18:48:47Z

Hmm for drop items, looking at Kingfisher Collect, I see we drop items in 3 different scenarios;

Sample (not relevant)
Duplicates (not a reason for rejecting a collection for the registry but good for reporting to the partner)
Invalid JSONs (good reason for rejecting a collection and already included in errors_rate)

So, the drop_rate won't be used for rejecting a collection, and the rate only, mixing duplicates and invalid JSONs, is not useful for reporting.

Should we just remove drop_rate and check item_dropped_count and invalid_json_count individually as needed in open-contracting/kingfisher-collect#531 instead?

jpmckinney · 2024-12-12T18:57:03Z

Yes, makes sense

yolile added the enhancement New feature or request label Apr 24, 2024

yolile mentioned this issue Dec 11, 2024

feat: invalid_json_count to error_rate, update tests and commands #14

Merged

yolile closed this as completed in #14 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new features #11

Add new features #11

yolile commented Apr 24, 2024 •

edited

Loading

jpmckinney commented Apr 25, 2024 •

edited

Loading

yolile commented Apr 25, 2024 •

edited by jpmckinney

Loading

jpmckinney commented Apr 25, 2024

jpmckinney commented Dec 12, 2024

yolile commented Dec 12, 2024

jpmckinney commented Dec 12, 2024

Add new features #11

Add new features #11

Comments

yolile commented Apr 24, 2024 • edited Loading

jpmckinney commented Apr 25, 2024 • edited Loading

yolile commented Apr 25, 2024 • edited by jpmckinney Loading

jpmckinney commented Apr 25, 2024

jpmckinney commented Dec 12, 2024

yolile commented Dec 12, 2024

jpmckinney commented Dec 12, 2024

yolile commented Apr 24, 2024 •

edited

Loading

jpmckinney commented Apr 25, 2024 •

edited

Loading

yolile commented Apr 25, 2024 •

edited by jpmckinney

Loading