Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tag clustering using ML #673

Merged
merged 7 commits into from
Oct 17, 2023
Merged

feat: tag clustering using ML #673

merged 7 commits into from
Oct 17, 2023

Conversation

hwelsters
Copy link
Contributor

Attempts to close comses/planning#125

Squashed commits and solved merge conflicts.

Summary

Perform tag clustering and gazetteering with dedupe.

Features

Tag Clustering is needed for creating the initial canonical list.

Curator commands

Created four new commands, one for clustering tags, the other for gazetteering / canonicalization.

1 curator_cluster_tags

This creates TagCluster objects. These can then be edited by going to curator_edit_clusters

  • --label - Lets the curator label the training data via the console.
  • --reset - The user is usually stopped from clustering if there is a clustering session that is not complete. This argument removes all clusters before clustering.
  • --threshold=[number from 0-1] - This can be changed to alter the model threshold.

2 curator_edit_clusters

This command lets the user edit clusters and then save the mappings to the database.
While modifying clusters, there are four options.
(c)hange canonical tag name - Lets you change the name of the canonical tag
(a)dd tags - Lets you add tags to the cluster
(r)emove tags - Lets you remove tags from the cluster
(s)ave - Saves the cluster to the database.
(f)inish - This does not save the cluster. It just means you are done with changing it and you're moving on. I decided not to autosave since there are certain cases where the user might want to just get rid of the cluster instead of saving it.

3 curator_map_tags

This command attempts to map a tag to a canonical tag if it currently isn't already mapped.

  • --label - Lets the curator label the training data via the console.
  • --threshold=[number from 0-1] - This can be changed to alter the model threshold.

4 curator_modify_cannon

This command is used if the user would like to modify the canonical list.

Tests

Wrote tests using Django tests

@hwelsters hwelsters closed this Oct 6, 2023
@hwelsters hwelsters reopened this Oct 6, 2023
@hwelsters hwelsters mentioned this pull request Oct 7, 2023
@alee
Copy link
Member

alee commented Oct 11, 2023

tests appear to be failing, possibly due to some missing files / paths?

======================================================================
ERROR: test_search (curator.tests.test_tag_deduplication.TestTagGazetteering)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/code/curator/tests/test_tag_deduplication.py", line 78, in test_search
    clusters = self._search()
  File "/code/curator/tests/test_tag_deduplication.py", line 86, in _search
    tag_clustering = TagGazetteer(search_threshold=0.5)
  File "/code/curator/tag_deduplication.py", line 232, in __init__
    self.deduper.train()
  File "/usr/local/lib/python3.10/dist-packages/dedupe/api.py", line 1213, in train
    self.classifier.fit(self.data_model.distances(examples), y)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 1422, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 858, in evaluate_candidates
    enumerate(candidate_params), enumerate(cv.split(X, y, groups))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py", line 808, in split
    y = check_array(y, input_name="y", ensure_2d=False, dtype=None)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 967, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

======================================================================
ERROR: test_uncertain_pairs (curator.tests.test_tag_deduplication.TestTagGazetteering)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/code/curator/tests/test_tag_deduplication.py", line 68, in test_uncertain_pairs
    tag_clustering = TagGazetteer(search_threshold=0.5)
  File "/code/curator/tag_deduplication.py", line 232, in __init__
    self.deduper.train()
  File "/usr/local/lib/python3.10/dist-packages/dedupe/api.py", line 1213, in train
    self.classifier.fit(self.data_model.distances(examples), y)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 1422, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 858, in evaluate_candidates
    enumerate(candidate_params), enumerate(cv.split(X, y, groups))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py", line 808, in split
    y = check_array(y, input_name="y", ensure_2d=False, dtype=None)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 967, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

----------------------------------------------------------------------
Ran 87 tests in 179.453s
FAILED (errors=2)
Destroying test database for alias 'default'...
make: *** [Makefile:144: test] Error 1

@hwelsters
Copy link
Contributor Author

hwelsters commented Oct 11, 2023

tests appear to be failing, possibly due to some missing files / paths?

======================================================================
ERROR: test_search (curator.tests.test_tag_deduplication.TestTagGazetteering)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/code/curator/tests/test_tag_deduplication.py", line 78, in test_search
    clusters = self._search()
  File "/code/curator/tests/test_tag_deduplication.py", line 86, in _search
    tag_clustering = TagGazetteer(search_threshold=0.5)
  File "/code/curator/tag_deduplication.py", line 232, in __init__
    self.deduper.train()
  File "/usr/local/lib/python3.10/dist-packages/dedupe/api.py", line 1213, in train
    self.classifier.fit(self.data_model.distances(examples), y)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 1422, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 858, in evaluate_candidates
    enumerate(candidate_params), enumerate(cv.split(X, y, groups))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py", line 808, in split
    y = check_array(y, input_name="y", ensure_2d=False, dtype=None)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 967, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

======================================================================
ERROR: test_uncertain_pairs (curator.tests.test_tag_deduplication.TestTagGazetteering)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/code/curator/tests/test_tag_deduplication.py", line 68, in test_uncertain_pairs
    tag_clustering = TagGazetteer(search_threshold=0.5)
  File "/code/curator/tag_deduplication.py", line 232, in __init__
    self.deduper.train()
  File "/usr/local/lib/python3.10/dist-packages/dedupe/api.py", line 1213, in train
    self.classifier.fit(self.data_model.distances(examples), y)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 1422, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py", line 858, in evaluate_candidates
    enumerate(candidate_params), enumerate(cv.split(X, y, groups))
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_split.py", line 808, in split
    y = check_array(y, input_name="y", ensure_2d=False, dtype=None)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 967, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

----------------------------------------------------------------------
Ran 87 tests in 179.453s
FAILED (errors=2)
Destroying test database for alias 'default'...
make: *** [Makefile:144: test] Error 1

I will look into this. I ran the tests on my environment and it passed so I think you might be right about the missing files.

Copy link
Member

@alee alee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work @hwelsters ! Included a few comments inline after a preliminary review of the code. I'll add more comments later after testing the actual functionality

django/curator/models.py Outdated Show resolved Hide resolved
django/curator/tag_deduplication.py Outdated Show resolved Hide resolved
django/curator/tag_deduplication.py Outdated Show resolved Hide resolved
django/curator/tag_deduplication.py Show resolved Hide resolved
@alee alee merged commit fd2896b into comses:main Oct 17, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants