Deduplicate ZIM tag values #156

benoit74 · 2024-04-19T08:31:16Z

When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake.

https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93

dan-niles · 2024-04-19T11:53:59Z

@benoit74 I'd like to work on this.

One possible solution is to convert the list into a set and back to a list again so that duplicates will be removed.

self.tags = list(set([*self.tags, "_category:ted", "ted", "_videos:yes"]))

WDYT?

rgaudin · 2024-04-19T12:16:47Z

Should probably be done in scraperlib

benoit74 · 2024-04-19T12:24:46Z

Should probably be done in scraperlib

Agreed, let's transfer the issue.

@dan-niles yes, that's the idea, but to do in scraperlib so that it benefit all scrapers, are you still interested?

dan-niles · 2024-04-20T05:27:38Z

@benoit74 Sure, I'm up for it.
I think we can remove the duplicates inside the config_metadata method in the scraperlib code.

I noticed that some scrapers like ted and youtube use the make_zim_file function from scraperlib, which initializes a Creator object and calls the config_metadata method.
While warc2zim and kolibri initialize a Creator object and calls the config_metadata method directly.

Since these scrapers eventually end up calling the config_metadata method, I think if we do the deduplication there, we only have to update in one place. What do you think?

benoit74 · 2024-04-30T12:39:18Z

Yep, this makes sense. Good observations!

benoit74 · 2024-06-11T11:28:19Z

Strongly related to #164, should be implemented together

benoit74 added enhancement New feature or request good first issue Good for newcomers labels Apr 19, 2024

benoit74 transferred this issue from openzim/ted Apr 19, 2024

benoit74 assigned dan-niles Apr 30, 2024

benoit74 mentioned this issue Jun 11, 2024

Add utility function to compute ZIM Tags #164

Closed

benoit74 added this to the 3.4.0 milestone Jun 11, 2024

benoit74 modified the milestones: 3.4.0, 3.5.0 Jun 20, 2024

benoit74 mentioned this issue Jun 28, 2024

Enhance tags manipulation #175

Merged

benoit74 assigned benoit74 and unassigned dan-niles Jun 28, 2024

benoit74 closed this as completed in #175 Jul 4, 2024

benoit74 modified the milestones: 4.1.0, 4.0.0, 3.5.0 Jul 10, 2024

benoit74 modified the milestones: 3.5.0, 4.0.0 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate ZIM tag values #156

Deduplicate ZIM tag values #156

benoit74 commented Apr 19, 2024 •

edited

Loading

dan-niles commented Apr 19, 2024

rgaudin commented Apr 19, 2024

benoit74 commented Apr 19, 2024

dan-niles commented Apr 20, 2024 •

edited

Loading

benoit74 commented Apr 30, 2024

benoit74 commented Jun 11, 2024

Deduplicate ZIM tag values #156

Deduplicate ZIM tag values #156

Comments

benoit74 commented Apr 19, 2024 • edited Loading

dan-niles commented Apr 19, 2024

rgaudin commented Apr 19, 2024

benoit74 commented Apr 19, 2024

dan-niles commented Apr 20, 2024 • edited Loading

benoit74 commented Apr 30, 2024

benoit74 commented Jun 11, 2024

benoit74 commented Apr 19, 2024 •

edited

Loading

dan-niles commented Apr 20, 2024 •

edited

Loading