Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate ZIM tag values #156

Closed
benoit74 opened this issue Apr 19, 2024 · 6 comments · Fixed by #175
Closed

Deduplicate ZIM tag values #156

benoit74 opened this issue Apr 19, 2024 · 6 comments · Fixed by #175
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Apr 19, 2024

When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake.

https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93

@benoit74 benoit74 added enhancement New feature or request good first issue Good for newcomers labels Apr 19, 2024
@dan-niles
Copy link

@benoit74 I'd like to work on this.

One possible solution is to convert the list into a set and back to a list again so that duplicates will be removed.

self.tags = list(set([*self.tags, "_category:ted", "ted", "_videos:yes"]))

WDYT?

@rgaudin
Copy link
Member

rgaudin commented Apr 19, 2024

Should probably be done in scraperlib

@benoit74
Copy link
Collaborator Author

Should probably be done in scraperlib

Agreed, let's transfer the issue.

@dan-niles yes, that's the idea, but to do in scraperlib so that it benefit all scrapers, are you still interested?

@benoit74 benoit74 transferred this issue from openzim/ted Apr 19, 2024
@dan-niles
Copy link

dan-niles commented Apr 20, 2024

@benoit74 Sure, I'm up for it.
I think we can remove the duplicates inside the config_metadata method in the scraperlib code.

I noticed that some scrapers like ted and youtube use the make_zim_file function from scraperlib, which initializes a Creator object and calls the config_metadata method.
While warc2zim and kolibri initialize a Creator object and calls the config_metadata method directly.

Since these scrapers eventually end up calling the config_metadata method, I think if we do the deduplication there, we only have to update in one place. What do you think?

@benoit74
Copy link
Collaborator Author

Yep, this makes sense. Good observations!

@benoit74
Copy link
Collaborator Author

Strongly related to #164, should be implemented together

@benoit74 benoit74 added this to the 3.4.0 milestone Jun 11, 2024
@benoit74 benoit74 modified the milestones: 3.4.0, 3.5.0 Jun 20, 2024
@benoit74 benoit74 assigned benoit74 and unassigned dan-niles Jun 28, 2024
@benoit74 benoit74 modified the milestones: 4.1.0, 4.0.0, 3.5.0 Jul 10, 2024
@benoit74 benoit74 modified the milestones: 3.5.0, 4.0.0 Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants