Duplicate Detection

This repository carries the source code and instructions for the duplication detection service deployed in X5GON

This file contains the script to update the X5GON DB with documents which contains exact same values in a newly created column duplicate(boolean) as TRUE

8542 exact duplicate materials was found with 3230 distinct values which implies 5312 documents can be disregarded as duplicates
This file was used to detect all the duplicate clusters in the X5GON DB using TF (Term Frequency) and WIKI as metrics to determine whether a pair of document is a duplicate or not.

TF > 0.85 and WIKI > 0.95 were used as thresholds for a document pair to be considered as a Duplicate.

This contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates. This result was used to plot the following graph to analyse the result visually.

This interactive graph was used to evaluate the results produced by the above proposed method. Each dot represents a documents and clusters represent a set of duplicate documents.

This graph can be generated using graph_draw.py file. Also you can use the ipython notebook for more interactive analysis which has the option to click on a node to open the respective document.

Datasets

Description	Link	Info
Results Dataset	link	This dataset contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates
Manually Evaluated Dataset	link	This dataset contains manual evaluation done on the above obtained results

TODO

Write the script for the cron job to be run on the X5GON server to update duplicates of future OER materials.
Write the script to update the DB with a new table using obtained results

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
LICENSE		LICENSE
README.md		README.md
img.png		img.png
img_1.png		img_1.png
results_deuplicate_detection.csv		results_deuplicate_detection.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Detection

Datasets

TODO

About

Releases

Packages

Contributors 2

Languages

License

X5GON/dupe_detect

Folders and files

Latest commit

History

Repository files navigation

Duplicate Detection

Datasets

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages