Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving dependence from custom branch's tour_model to master's trip_model #933

Merged
merged 10 commits into from
Sep 14, 2023

Conversation

humbleOldSage
Copy link
Contributor

@humbleOldSage humbleOldSage commented Aug 20, 2023

The following changes reduces e-mission-server-eval-private-data's TRB_label_assist dependence on custom branch (hlu09's branch).

The clustering.py file (from the commit e-mission/e-mission-eval-private-data@88988d3 ) passes the config['clustering_way'] to greedy_similarity_binning.py as config parameter.

The following changes support e-mission-server-eval-private's  TRB_label_assist, reducing dependence on custom branch.
@humbleOldSage humbleOldSage changed the title Moving Dependence from tour_model to trip_model Moving dependence from custom branch's tour_model to master's trip_model Aug 20, 2023
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this works (assuming typo is fixed), I am not sure it is the best generalization.
Note that the current implementation is extremely general - there is nothing really to indicate that this is related to trips, or that the features are related to origin/destination etc

We could theoretically pass in distances and it will still work.

With the current code structure, I think you are expected to deal with the clusteringWay more upstream. When you extract features from the trips to pass into the similar function, you would pass in only the origin or the destination or both (e.g. if clustering_way == 'origin', a and b would be of length 1.

Reminder: you also need to add new unit tests for the new functionality

@shankari
Copy link
Contributor

The full set of dependencies in the main e-mission server for the "similar" function
$ grep -rl similar emission/ --exclude=*.pyc
emission//net/api/wsgiserver2.py
emission//core/wrapper/untrackedtime.py
emission//analysis/classification/inference/Classifier.ipynb
emission//analysis/modelling/tour_model_first_only_orig/prior_unused/cluster_pipeline.py
emission//analysis/modelling/tour_model_first_only_orig/data_preprocessing.py
emission//analysis/modelling/tour_model_first_only_orig/get_scores.py
emission//analysis/modelling/tour_model_first_only_orig/similarity.py
emission//analysis/modelling/tour_model_first_only_orig/load_predict.py
emission//analysis/modelling/tour_model_first_only_orig/second_round_of_clustering.py
emission//analysis/modelling/tour_model_first_only_orig/get_request_percentage.py
emission//analysis/modelling/tour_model_first_only_orig/label_processing.py
emission//analysis/modelling/tour_model_first_only_orig/trajectory_matching/Frechet.py
emission//analysis/modelling/tour_model_first_only_orig/cluster_pipeline.py
emission//analysis/modelling/tour_model_first_only_orig/evaluation_pipeline.py
emission//analysis/modelling/tour_model_first_only_orig/cluster_groundtruth.py
emission//analysis/modelling/similarity/similarity_metric.py
emission//analysis/modelling/similarity/od_similarity.py
emission//analysis/modelling/similarity/similarity_metric_type.py
emission//analysis/modelling/tour_model/prior_unused/cluster_pipeline.py
emission//analysis/modelling/tour_model/data_preprocessing.py
emission//analysis/modelling/tour_model/get_scores.py
emission//analysis/modelling/tour_model/similarity.py
emission//analysis/modelling/tour_model/load_predict.py
emission//analysis/modelling/tour_model/second_round_of_clustering.py
emission//analysis/modelling/tour_model/get_request_percentage.py
emission//analysis/modelling/tour_model/label_processing.py
emission//analysis/modelling/tour_model/trajectory_matching/Frechet.py
emission//analysis/modelling/tour_model/cluster_pipeline.py
emission//analysis/modelling/tour_model/evaluation_pipeline.py
emission//analysis/modelling/tour_model/cluster_groundtruth.py
emission//analysis/modelling/tour_model_first_only/data_preprocessing.py
emission//analysis/modelling/tour_model_first_only/load_predict.py
emission//analysis/modelling/tour_model_first_only/evaluation_pipeline.py
emission//analysis/modelling/user_model/user_utility_model.py
emission//analysis/modelling/trip_model/model_type.py
emission//analysis/modelling/trip_model/util.py
emission//analysis/modelling/trip_model/greedy_similarity_binning.py
emission//analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py
emission//tests/analysisTests/resultTests/TestTimeGrouping.py
emission//tests/modellingTests/TestBackwardsCompat.py
emission//tests/modellingTests/TestRunGreedyIncrementalModel.py
emission//tests/modellingTests/TestRunGreedyModel.py
emission//tests/modellingTests/TestSimilarityMetric.py
emission//tests/modellingTests/TestSimilarityAux.py
emission//tests/modellingTests/TestGreedySimilarityBinning.py
emission//incomplete_tests/TestSimilarity.py

Filtering out all the tour model code, we find a lot of tests

$ grep -rl similar emission/ --exclude=*.pyc | grep -v tour_model
emission//net/api/wsgiserver2.py
emission//core/wrapper/untrackedtime.py
emission//analysis/classification/inference/Classifier.ipynb
emission//analysis/modelling/similarity/similarity_metric.py
emission//analysis/modelling/similarity/od_similarity.py
emission//analysis/modelling/similarity/similarity_metric_type.py
emission//analysis/modelling/user_model/user_utility_model.py
emission//analysis/modelling/trip_model/model_type.py
emission//analysis/modelling/trip_model/util.py
emission//analysis/modelling/trip_model/greedy_similarity_binning.py
emission//analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py
emission//tests/analysisTests/resultTests/TestTimeGrouping.py
emission//tests/modellingTests/TestBackwardsCompat.py
emission//tests/modellingTests/TestRunGreedyIncrementalModel.py
emission//tests/modellingTests/TestRunGreedyModel.py
emission//tests/modellingTests/TestSimilarityMetric.py
emission//tests/modellingTests/TestSimilarityAux.py
emission//tests/modellingTests/TestGreedySimilarityBinning.py
emission//incomplete_tests/TestSimilarity.py

Filtering out all the tests, we find a handful of locations

$ grep -rl similar emission/ --exclude=*.pyc | grep -v tour_model | grep -v modellingTests
emission//net/api/wsgiserver2.py
emission//core/wrapper/untrackedtime.py
emission//analysis/classification/inference/Classifier.ipynb
emission//analysis/modelling/similarity/similarity_metric.py
emission//analysis/modelling/similarity/od_similarity.py
emission//analysis/modelling/similarity/similarity_metric_type.py
emission//analysis/modelling/user_model/user_utility_model.py
emission//analysis/modelling/trip_model/model_type.py
emission//analysis/modelling/trip_model/util.py
emission//analysis/modelling/trip_model/greedy_similarity_binning.py
emission//analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py
emission//tests/analysisTests/resultTests/TestTimeGrouping.py
emission//incomplete_tests/TestSimilarity.py

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Aug 24, 2023

The grep statement above is matching 'similarity' along with 'similar'.

The results below are for similar only.

$ grep -rl -w similar Documents/GitHub/e-mission-server/emission/ --exclude=*.pyc | grep -v tour_model | grep -v modellingTests
Documents/GitHub/e-mission-server/emission//net/api/wsgiserver2.py
Documents/GitHub/e-mission-server/emission//core/wrapper/untrackedtime.py
Documents/GitHub/e-mission-server/emission//analysis/classification/inference/Classifier.ipynb
Documents/GitHub/e-mission-server/emission//analysis/modelling/similarity/similarity_metric.py
Documents/GitHub/e-mission-server/emission//analysis/modelling/user_model/user_utility_model.py
Documents/GitHub/e-mission-server/emission//analysis/modelling/trip_model/greedy_similarity_binning.py
Documents/GitHub/e-mission-server/emission//analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py
Documents/GitHub/e-mission-server/emission//tests/analysisTests/resultTests/TestTimeGrouping.py

Here, except for 4 ( similarity_metric.py) and 6 ( greedy_similarity_binning.py), all others matches are in comments and so no need to change them.

4 has the definition of similar function and 6 has those calls for similar from _find_matching_bin_id which we have been improving till now. There's just one another call from _nearest_bin in 6 itself that we need to change. So all in all, there are no other dependencies that we need to worry about while changing similar.

@humbleOldSage humbleOldSage marked this pull request as draft August 24, 2023 04:14
Moved the `clusteringWay` based decision making while binning further upstream, thus generalising `similar` (in `similarity_metrics.py`) and `similarity` ( in `od_similarity.py`) functions. Can now be used across modules without the need for `clusteringWay` parameter.
Comment fixes for better readability.
@humbleOldSage humbleOldSage requested a review from shankari August 24, 2023 07:06
@humbleOldSage humbleOldSage marked this pull request as ready for review August 24, 2023 07:07
@humbleOldSage
Copy link
Contributor Author

Reminder: you also need to add new unit tests for the new functionality

I have already worked on a few of them. As soon as we finalize on the flow here, I'll commit them.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the minor changes, primarily around commenting, this looks fine.
Note that you also need to add unit tests for the new functionality.

MukuFlash03 pushed a commit to MukuFlash03/e-mission-server that referenced this pull request Aug 26, 2023
Implemented code for issue e-mission#933 in e-mission-docs for adding functionality to count number of documents.
I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection.
A query is generated with optional extra_query keys list which returns filtered data set.
Tests created to confirm  configuration for trip clustering (origin, destination and origin-destination) work as expected inside the GreedySimilarityBinning class  in `greedy_similarity_binning.py` file.

In order to upgrade old tests, `generate_mock_trips` in `modellingTestAssets.py` was also changed. Previously, out of the n trips generated, m had both origin and destination either inside or outside threshold,thus allowing only 2 configs. Now, 4 configurations are possible, one among origin OR destination OR origin-and-destination or Neither-origin-nor-destination. Default is set to 'origin-and-destination' since this was the old default.
@humbleOldSage humbleOldSage marked this pull request as draft August 31, 2023 06:11
@humbleOldSage humbleOldSage requested a review from shankari August 31, 2023 06:11
@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Aug 31, 2023

Still need to check other tests dependent on modellingTestAssets.py's generate_mock_trips function. Specifically, these files :

$ grep -rl generate_mock_trips | grep -v __pycache__
./emission/tests/modellingTests/TestBackwardsCompat.py
./emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
./emission/tests/modellingTests/TestRunGreedyModel.py
./emission/tests/modellingTests/TestSimilarityMetric.py
./emission/tests/modellingTests/modellingTestAssets.py
./emission/tests/modellingTests/TestGreedySimilarityBinning.py
./.git/COMMIT_EDITMSG

Checking `Similarity` behaves as expected when list of size 2 ( for only origin OR only destination ) or size 4 (for origin AND destination) are  passed.
@humbleOldSage humbleOldSage marked this pull request as ready for review August 31, 2023 19:29
@humbleOldSage
Copy link
Contributor Author

Still need to check other tests dependent on modellingTestAssets.py's generate_mock_trips function. Specifically, these files :

This was completed with the last commit.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! You just need to clean up and polish the tests a bit.

similarity_threshold = 500 #
# random, but, points are sampled within a circle and should always be < sim threshold
trips = etmm.generate_mock_trips('bob', 2, [0, 0], [1, 1], threshold=generate_points_thresh)
similarity_threshold = 111 #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you have to change the threshold to 111? I can understand filtering for o and d, but why do you have to change the threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as 710d1a5#r1312355176 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we have now changed the mock trip creation code, this should still work with 500, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it does.
And it is currently set to 500 in the latest commit.

emission/tests/modellingTests/TestSimilarityMetric.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/TestSimilarityMetric.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/TestSimilarityMetric.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/TestSimilarityMetric.py Outdated Show resolved Hide resolved
self.assertTrue(at_least_one_large_bin, "at least one bin should have at least 5 features in it")

at_least_one_large_bin = any(map(lambda b: len(b['feature_rows']) ==m, model2.bins.values()))
self.assertTrue(at_least_one_large_bin, "no bin should have more than 1 features in it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the messages for the assert seem to be wrong given len(b['feature_rows']) == m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed as well.

@humbleOldSage humbleOldSage marked this pull request as draft September 3, 2023 02:22
1. improved logic based on this comment . e-mission@710d1a5#r1314065502

2.Created a utilities file for repetitive code required by multiple files.

3. clustering threshold back to 500

4. More in-code comments.
@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Sep 7, 2023

@humbleOldSage have you tried running these tests? have you seen if they are run by runAllTests?

yes. But I ran each of them individually and they run perfectly.

I figured that these tests are invisible to the command :

PYTHONPATH=. python -m unittest discover -v emission/tests/modellingTests/

because of the names of the file, which is, Test*.py .
To be discoverable, they need to take the form test_*.py and then the above command would run.

Should I rename them ?

@shankari
Copy link
Contributor

shankari commented Sep 7, 2023

because of the names of the file, which is, Test*.py .
To be discoverable, they need to take the form test_*.py and then the above command would run.

I am not sure why you are focused on python -m unittest discover -v emission/tests/modellingTests/ - that is not what we run in runAllTests.sh, which is the script that launches all automated tests

Also, the files in (say) emission/tests/analysisTests/intakeTests all start with Test and are run successfully

@humbleOldSage
Copy link
Contributor Author

Got it .

So, modifying runAllTests.sh file from its initial command :

  PYTHONPATH=. python -m unittest discover -s emission/tests -p Test*;

to

 PYTHONPATH=. python -m unittest discover -s emission/tests/modellingTests -p Test*;

runs all 21 tests without failure on the local machine.( I changed this so that we don't have to run all other test and focus just on modelling ones ).

@shankari
Copy link
Contributor

shankari commented Sep 7, 2023

runs all 21 tests without failure on the local machine.( I changed this so that we don't have to run all other test and focus just on modelling ones ).

That's good, and you should list that in the "testing done". However, we should also make sure that the tests run as part of runAllTests.sh so that they are run as part of the CI. If you run runAllTests.sh - does it run them or not?

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Sep 7, 2023

When I run runAllTests.sh, this is what it shows me just in the beginning:

analysis.trip_model.conf.json not configured, falling back to sample, default configuration
expectations.conf.json not configured, falling back to sample, default configuration
ERROR:root:habitica not configured, game functions not supported
Traceback (most recent call last):
  File "/Users/ssaini/Documents/GitHub/e-mission-server/emission/net/ext_service/habitica/proxy.py", line 22, in <module>
    key_file = open('conf/net/ext_service/habitica.json')
FileNotFoundError: [Errno 2] No such file or directory: 'conf/net/ext_service/habitica.json'
Finished configuring logging for <RootLogger root (WARNING)>
WARNING:root:No user defined overrides for key config/sensor_config and user 4f1fc8f0-4d7d-4d3b-8081-0e4b3ffc9b67, early return
WARNING:root:No user defined overrides for key config/sync_config and user 4f1fc8f0-4d7d-4d3b-8081-0e4b3ffc9b67, early return
WARNING:root:No user defined overrides for key config/consent and user 4f1fc8f0-4d7d-4d3b-8081-0e4b3ffc9b67, early return
.WARNING:root:No user defined overrides for key config/sensor_config and user 2ee2248e-9520-40a6-a54e-7af2d8aa97f3, early return
WARNING:root:No user defined overrides for key config/sync_config and user 2ee2248e-9520-40a6-a54e-7af2d8aa97f3, early return
WARNING:root:No user defined overrides for key config/consent and user 2ee2248e-9520-40a6-a54e-7af2d8aa97f3, early return
.WARNING:root:No user defined overrides for key config/sync_config and user b14e823f-1a9c-460d-815a-2aefd21dd71f, early return
WARNING:root:No user defined overrides for key config/consent and user b14e823f-1a9c-460d-815a-2aefd21dd71f, early return
.WARNING:root:No user defined overrides for key config/sync_config and user a435e378-c92a-4653-8540-b51f44b635ae, early return
WARNING:root:No user defined overrides for key config/consent and user a435e378-c92a-4653-8540-b51f44b635ae, early return
.Setting up real example for 97779ee3-004d-4ffa-b7d3-a2db3ab112d7

and then it gets stuck there . NO output. The maximum I waited for this was 30 mins. Will leave this overnight to see if it progresses.

@shankari
Copy link
Contributor

shankari commented Sep 7, 2023

Do you still have the full dataset loaded? That is going to slow down the tests. I would suggest shutting down this DB (assuming you have the data stored on a persistent volume) and starting a new blank one for better performance.

@humbleOldSage
Copy link
Contributor Author

Ok. I'll do this.
Yeah, it's persistent now.

Random trips are now generated like this :

if  certain trips is are to be binned together ( by 'o','d' or 'od' or '__' (meaning NONE)) they are generated in proximity of the previous in-bin trip.  Otherwise, if they are not to be binned together, we keep generating a random trip unless we find one that would not bin with previously accepted trips.
@humbleOldSage humbleOldSage marked this pull request as draft September 8, 2023 22:10
@shankari
Copy link
Contributor

shankari commented Sep 8, 2023

Ah the modellingTests are enabled! But alas, they are failing.
Did you test before sending the change for review?

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Sep 8, 2023

I did, on just one Testfile which passed. I think the logic works.But I wanted to run the logic though you first and then handle all the dependencies.

@shankari
Copy link
Contributor

shankari commented Sep 8, 2023

If this is the case, please indicate testing done and that the tests are expected to fail in the commit and before sending the PR for review

@humbleOldSage
Copy link
Contributor Author

yeah .I did move the PR to draft for this reason, but I'll do this as well moving forward.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See only one conceptual clarification.
Others are software engineering fixes and requests for comments.

emission/tests/modellingTests/modellingTestAssets.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/modellingTestAssets.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/modellingTestAssets.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/modellingTestAssets.py Outdated Show resolved Hide resolved
emission/tests/modellingTests/modellingTestAssets.py Outdated Show resolved Hide resolved
`od_similarity.py`
1.  Explicitly passing 'origin', 'destination', 'origin-destination' for similarity check  in `similarity`

`similarity_metric.py`
2.  Passing the clustering_way parameter

`greedy_similarity_binning.py`
3.  Since this decision making is moved downstream to `similarity`, so removing it from here.

`modellingTestAssets.py`
4. Removing both 2 line wrappers (SetModelConfig, setTripConfig ) from this file since this was parametrised using sub-Test 2 commits back.

5. Removed CalDistanceTest. This was introduced to keep calDistance of test separate from the calDistance being used by the one being used by `greedySimilaritybinning`.  Unnecesary.

6.  Using ref. coordinates whenever provided to generate trip coordinates. If not, use randomly generated coordinates as reference points.

7. receiving and passing origin and destination ref. points.  in `generate_mock_trips'

`TestGreedySimilarityBinning.py`

8. removed wrappers for trip and model generation.

9. Using just single threshold for generating trips and for binning. Removed two thresholds.

`TestSimilarityMetric.py`

10. Removing the implicitness used in binning by passing this as a parameter.
@humbleOldSage humbleOldSage marked this pull request as ready for review September 12, 2023 16:28
@humbleOldSage
Copy link
Contributor Author

Final Test output from runAllTest.sh:

Screen Shot 2023-09-12 at 11 33 47 AM

Generating Random points from  circle ( rather than Square) around ref_points.

Better Explanations for  random point generation.

Whitespace fixes.
Comment on lines 38 to 39
#This basically gives a way to sample a point from within a square of length thresholdInWGS84
# around the ref. point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (future fix) fix the comment to reflect "the circle" instead of "the square"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comments and variable names fixed
@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Sep 14, 2023

Didn't test. there isn't anything changed that could cause a failure.

similarOD = metric.similar(trip0_coords,trip1_coords, similarity_threshold,cw)
# Since both origin and destination poitns lie within threshold limits,they should be similar
# when we check by just origin or just destination or both origin-and-destination
self.assertTrue(similarOD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (future fix) if this fails, there is no message which indicates why. and given that it is an assertTrue, you won't even get the autogenerated message.

IsSimilar = metric.similar(trip0_coord,trip1_coord, similarity_threshold,cw)
# Two trips with neither origin nor destination coordinates within the threshold
# must not be similar by any configuration of similarity testing.
self.assertFalse(IsSimilar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


for cw in parameters:
with self.subTest(clustering_way=cw):
IsSimilar = metric.similar(trip0_coord,trip1_coord, similarity_threshold,cw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (future fix) also IsSimilar -> isSimilar

@shankari
Copy link
Contributor

I have some small cleanup fixes that you can address in the next PR. I plan to merge this once all the tests are done.

@shankari
Copy link
Contributor

@humbleOldSage I expect you to address the cleanup comments in a subsequent PR

@shankari shankari merged commit 55704fc into e-mission:master Sep 14, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants