Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new features #11

Closed
yolile opened this issue Apr 24, 2024 · 6 comments · Fixed by #14
Closed

Add new features #11

yolile opened this issue Apr 24, 2024 · 6 comments · Fixed by #14
Labels
enhancement New feature or request

Comments

@yolile
Copy link
Member

yolile commented Apr 24, 2024

The library currently has:

  • crawl_time
  • is_finished
  • item_counts
  • spider_arguments
  • is_complete
  • error_rate (but using File vs FileError)

As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, open-contracting/kingfisher-collect#531, open-contracting/kingfisher-collect#1058 and open-contracting/kingfisher-collect#1055 I think we need to add:

  1. drop_rate: item_dropped_count vs item_scraped_count
  2. duplicate_request_rate: dupefilter/filtered vs downloader/request_count https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
  3. invalid_json_rate: invalid_json_count vs item_scraped_count
  4. finish_reason: cancelled, finished, etc as documented https://kingfisher-collect.readthedocs.io/en/latest/logs.html#check-the-reason-for-closing-the-spider
  5. errors_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
  6. errors_list: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
  7. response_status_counts: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-response-status-codes
  8. exception_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-downloader-exceptions
  9. retry_times_reached: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-requests-for-which-the-maximum-number-of-retries-was-reached

@jpmckinney any others? Do you agree with implementing all of them?

@yolile yolile added the enhancement New feature or request label Apr 24, 2024
@jpmckinney
Copy link
Member

jpmckinney commented Apr 25, 2024

In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:

This might also help decide which of the new methods is most worthwhile to implement.

For example, we might decide to add new methods related to "insufficiently clean".


error_rate (but using File vs FileError)

Do you mean it should be using something else?

(1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem.
(2) Where is duplicate request rate from?
(3) Maybe we should change the middleware to yield a FileError? That way, invalid JSON gets counted in the error rate. (Invalid JSON does seem like an error.)
(4) I think is_finished is all that matters. The reason itself can be accessed as logparser["crawler_stats"] – we don't need a new method for a simple access.
(5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.
(7) In principle, these should lead to FileError items. I think these might just be for reporting (e.g. logreport prints those out from logparser["crawler_stats"], without needing a new method in this package).
(8) I think this can be covered with 5-6.
(9) This is more to debug retries. In terms of collection quality, URLs that reach the max retries end up being FileErrors. Maybe we just report the retry/max_reached and let the user decide whether to investigate.

What do you think?

It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to open-contracting/data-registry#29

@yolile
Copy link
Member Author

yolile commented Apr 25, 2024

(1) ok
(2) I've edited the issue, but from https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
(4) Ok

(3) and
(5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.

We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner.

(7)(8)(9) ok

@jpmckinney
Copy link
Member

Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information.

Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating error_rate, we should include invalid JSON.

@jpmckinney
Copy link
Member

For 6 we can use logparser as described at #14 (comment)

@yolile
Copy link
Member Author

yolile commented Dec 12, 2024

Hmm for drop items, looking at Kingfisher Collect, I see we drop items in 3 different scenarios;

  • Sample (not relevant)
  • Duplicates (not a reason for rejecting a collection for the registry but good for reporting to the partner)
  • Invalid JSONs (good reason for rejecting a collection and already included in errors_rate)

So, the drop_rate won't be used for rejecting a collection, and the rate only, mixing duplicates and invalid JSONs, is not useful for reporting.

Should we just remove drop_rate and check item_dropped_count and invalid_json_count individually as needed in open-contracting/kingfisher-collect#531 instead?

@jpmckinney
Copy link
Member

Yes, makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants