Scraper does not properly filter-out private videos #362

benoit74 · 2024-10-14T08:11:42Z

https://farm.openzim.org/recipes/cest-pas-sorcier_fr_astronomie has failed two times in a row with an exit code 139, on two different (and beefy and "empty") workers:

I will try to investigate locally.

benoit74 · 2024-10-14T09:39:15Z

The problem is nasty.

In this playlist, we have a video (ID FbK-FPwSAFQ) which is now private but we have a cached video in S3, so it probably became private only "recently".

Currently scraper logic is (significantly simplified):

fetch all playlists we have to download
for every playlist, fetch it items (videos)
download these videos (preferably from S3 cache) and add them directly to the ZIM to save disk space
get channels details of every videos successfully downloaded
filter-out videos which failed to download or which do not have accessible channel details (hence private)
create JSON (was HTML) for navigating to videos which have been kept
delete videos which have not been kept

The problem on this video is that it succeeds to download (it is present in S3 cache) and hence added to the ZIM, but it is then filtered-out because private, and the scraper hence tries to delete the video while it is being added to the ZIM by libzim (this is an async task in libzim), hence causing an exit code 139.

I think we should reconsider the cleanup procedure to really delete only video which have not been successfully downloaded.

And we should also avoid adding to the ZIM a private video which will then be inaccessible (but still consuming space, and probably causing copyright problems).

benoit74 added the bug label Oct 14, 2024

benoit74 self-assigned this Oct 14, 2024

benoit74 mentioned this issue Oct 14, 2024

New request: (A few playlists) of Youtube Channel "C'est pas sorcier" openzim/zim-requests#1182

Closed

benoit74 changed the title ~~Exit code 139~~ Scraper tries to delete a video which is currently being added to the ZIM, causing exit code 139 Oct 14, 2024

benoit74 changed the title ~~Scraper tries to delete a video which is currently being added to the ZIM, causing exit code 139~~ Scraper does not properly filter-out private videos Oct 14, 2024

benoit74 mentioned this issue Oct 14, 2024

Filter-out non-public videos and properly cleanup unsuccessful videos #363

Merged

benoit74 added this to the 3.2.1 milestone Oct 14, 2024

benoit74 closed this as completed in #363 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper does not properly filter-out private videos #362

Scraper does not properly filter-out private videos #362

benoit74 commented Oct 14, 2024

benoit74 commented Oct 14, 2024

Scraper does not properly filter-out private videos #362

Scraper does not properly filter-out private videos #362

Comments

benoit74 commented Oct 14, 2024

benoit74 commented Oct 14, 2024