This repository carries the source code and instructions for the duplication detection service deployed in X5GON
-
This file contains the script to update the X5GON DB with documents which contains exact same values in a newly created column duplicate(boolean) as TRUE
8542 exact duplicate materials was found with 3230 distinct values which implies 5312 documents can be disregarded as duplicates
-
This file was used to detect all the duplicate clusters in the X5GON DB using TF (Term Frequency) and WIKI as metrics to determine whether a pair of document is a duplicate or not.
TF > 0.85 and WIKI > 0.95 were used as thresholds for a document pair to be considered as a Duplicate.
This contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates. This result was used to plot the following graph to analyse the result visually.
This interactive graph was used to evaluate the results produced by the above proposed method. Each dot represents a documents and clusters represent a set of duplicate documents.
This graph can be generated using graph_draw.py file. Also you can use the ipython notebook for more interactive analysis which has the option to click on a node to open the respective document.
Description | Link | Info |
---|---|---|
Results Dataset | link | This dataset contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates |
Manually Evaluated Dataset | link | This dataset contains manual evaluation done on the above obtained results |
- Write the script for the cron job to be run on the X5GON server to update duplicates of future OER materials.
- Write the script to update the DB with a new table using obtained results