Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper does not properly filter-out private videos #362

Closed
benoit74 opened this issue Oct 14, 2024 · 1 comment · Fixed by #363
Closed

Scraper does not properly filter-out private videos #362

benoit74 opened this issue Oct 14, 2024 · 1 comment · Fixed by #363
Assignees
Labels
Milestone

Comments

@benoit74
Copy link
Collaborator

https://farm.openzim.org/recipes/cest-pas-sorcier_fr_astronomie has failed two times in a row with an exit code 139, on two different (and beefy and "empty") workers:

I will try to investigate locally.

@benoit74 benoit74 added the bug label Oct 14, 2024
@benoit74 benoit74 self-assigned this Oct 14, 2024
@benoit74 benoit74 changed the title Exit code 139 Scraper tries to delete a video which is currently being added to the ZIM, causing exit code 139 Oct 14, 2024
@benoit74
Copy link
Collaborator Author

The problem is nasty.

In this playlist, we have a video (ID FbK-FPwSAFQ) which is now private but we have a cached video in S3, so it probably became private only "recently".

Currently scraper logic is (significantly simplified):

  • fetch all playlists we have to download
  • for every playlist, fetch it items (videos)
  • download these videos (preferably from S3 cache) and add them directly to the ZIM to save disk space
  • get channels details of every videos successfully downloaded
  • filter-out videos which failed to download or which do not have accessible channel details (hence private)
  • create JSON (was HTML) for navigating to videos which have been kept
  • delete videos which have not been kept

The problem on this video is that it succeeds to download (it is present in S3 cache) and hence added to the ZIM, but it is then filtered-out because private, and the scraper hence tries to delete the video while it is being added to the ZIM by libzim (this is an async task in libzim), hence causing an exit code 139.

I think we should reconsider the cleanup procedure to really delete only video which have not been successfully downloaded.

And we should also avoid adding to the ZIM a private video which will then be inaccessible (but still consuming space, and probably causing copyright problems).

@benoit74 benoit74 changed the title Scraper tries to delete a video which is currently being added to the ZIM, causing exit code 139 Scraper does not properly filter-out private videos Oct 14, 2024
@benoit74 benoit74 added this to the 3.2.1 milestone Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant