-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving dependence from custom branch's tour_model to master's trip_model #933
Moving dependence from custom branch's tour_model to master's trip_model #933
Conversation
The following changes support e-mission-server-eval-private's TRB_label_assist, reducing dependence on custom branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works (assuming typo is fixed), I am not sure it is the best generalization.
Note that the current implementation is extremely general - there is nothing really to indicate that this is related to trips, or that the features are related to origin/destination etc
We could theoretically pass in distances and it will still work.
With the current code structure, I think you are expected to deal with the clusteringWay
more upstream. When you extract features from the trips to pass into the similar function, you would pass in only the origin or the destination or both (e.g. if clustering_way == 'origin'
, a
and b
would be of length 1.
Reminder: you also need to add new unit tests for the new functionality
The full set of dependencies in the main e-mission server for the "similar" function
Filtering out all the tour model code, we find a lot of tests
Filtering out all the tests, we find a handful of locations
|
The grep statement above is matching 'similarity' along with 'similar'. The results below are for similar only.
Here, except for 4 ( 4 has the definition of |
Moved the `clusteringWay` based decision making while binning further upstream, thus generalising `similar` (in `similarity_metrics.py`) and `similarity` ( in `od_similarity.py`) functions. Can now be used across modules without the need for `clusteringWay` parameter.
Comment fixes for better readability.
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
Outdated
Show resolved
Hide resolved
I have already worked on a few of them. As soon as we finalize on the flow here, I'll commit them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the minor changes, primarily around commenting, this looks fine.
Note that you also need to add unit tests for the new functionality.
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
Outdated
Show resolved
Hide resolved
Implemented code for issue e-mission#933 in e-mission-docs for adding functionality to count number of documents. I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection. A query is generated with optional extra_query keys list which returns filtered data set.
Tests created to confirm configuration for trip clustering (origin, destination and origin-destination) work as expected inside the GreedySimilarityBinning class in `greedy_similarity_binning.py` file. In order to upgrade old tests, `generate_mock_trips` in `modellingTestAssets.py` was also changed. Previously, out of the n trips generated, m had both origin and destination either inside or outside threshold,thus allowing only 2 configs. Now, 4 configurations are possible, one among origin OR destination OR origin-and-destination or Neither-origin-nor-destination. Default is set to 'origin-and-destination' since this was the old default.
Still need to check other tests dependent on
|
Checking `Similarity` behaves as expected when list of size 2 ( for only origin OR only destination ) or size 4 (for origin AND destination) are passed.
This was completed with the last commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! You just need to clean up and polish the tests a bit.
similarity_threshold = 500 # | ||
# random, but, points are sampled within a circle and should always be < sim threshold | ||
trips = etmm.generate_mock_trips('bob', 2, [0, 0], [1, 1], threshold=generate_points_thresh) | ||
similarity_threshold = 111 # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you have to change the threshold to 111? I can understand filtering for o and d, but why do you have to change the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as 710d1a5#r1312355176 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we have now changed the mock trip creation code, this should still work with 500, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does.
And it is currently set to 500 in the latest commit.
self.assertTrue(at_least_one_large_bin, "at least one bin should have at least 5 features in it") | ||
|
||
at_least_one_large_bin = any(map(lambda b: len(b['feature_rows']) ==m, model2.bins.values())) | ||
self.assertTrue(at_least_one_large_bin, "no bin should have more than 1 features in it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the messages for the assert seem to be wrong given len(b['feature_rows']) == m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fixed as well.
1. improved logic based on this comment . e-mission@710d1a5#r1314065502 2.Created a utilities file for repetitive code required by multiple files. 3. clustering threshold back to 500 4. More in-code comments.
yes. But I ran each of them individually and they run perfectly. I figured that these tests are invisible to the command :
because of the names of the file, which is, Should I rename them ? |
I am not sure why you are focused on Also, the files in (say) |
Got it . So, modifying
to
runs all 21 tests without failure on the local machine.( I changed this so that we don't have to run all other test and focus just on modelling ones ). |
That's good, and you should list that in the "testing done". However, we should also make sure that the tests run as part of |
When I run
and then it gets stuck there . NO output. The maximum I waited for this was 30 mins. Will leave this overnight to see if it progresses. |
Do you still have the full dataset loaded? That is going to slow down the tests. I would suggest shutting down this DB (assuming you have the data stored on a persistent volume) and starting a new blank one for better performance. |
Ok. I'll do this. |
Random trips are now generated like this : if certain trips is are to be binned together ( by 'o','d' or 'od' or '__' (meaning NONE)) they are generated in proximity of the previous in-bin trip. Otherwise, if they are not to be binned together, we keep generating a random trip unless we find one that would not bin with previously accepted trips.
Ah the modellingTests are enabled! But alas, they are failing. |
I did, on just one Testfile which passed. I think the logic works.But I wanted to run the logic though you first and then handle all the dependencies. |
If this is the case, please indicate testing done and that the tests are expected to fail in the commit and before sending the PR for review |
yeah .I did move the PR to draft for this reason, but I'll do this as well moving forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See only one conceptual clarification.
Others are software engineering fixes and requests for comments.
`od_similarity.py` 1. Explicitly passing 'origin', 'destination', 'origin-destination' for similarity check in `similarity` `similarity_metric.py` 2. Passing the clustering_way parameter `greedy_similarity_binning.py` 3. Since this decision making is moved downstream to `similarity`, so removing it from here. `modellingTestAssets.py` 4. Removing both 2 line wrappers (SetModelConfig, setTripConfig ) from this file since this was parametrised using sub-Test 2 commits back. 5. Removed CalDistanceTest. This was introduced to keep calDistance of test separate from the calDistance being used by the one being used by `greedySimilaritybinning`. Unnecesary. 6. Using ref. coordinates whenever provided to generate trip coordinates. If not, use randomly generated coordinates as reference points. 7. receiving and passing origin and destination ref. points. in `generate_mock_trips' `TestGreedySimilarityBinning.py` 8. removed wrappers for trip and model generation. 9. Using just single threshold for generating trips and for binning. Removed two thresholds. `TestSimilarityMetric.py` 10. Removing the implicitness used in binning by passing this as a parameter.
Generating Random points from circle ( rather than Square) around ref_points. Better Explanations for random point generation. Whitespace fixes.
#This basically gives a way to sample a point from within a square of length thresholdInWGS84 | ||
# around the ref. point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (future fix) fix the comment to reflect "the circle" instead of "the square"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Comments and variable names fixed
Didn't test. there isn't anything changed that could cause a failure. |
similarOD = metric.similar(trip0_coords,trip1_coords, similarity_threshold,cw) | ||
# Since both origin and destination poitns lie within threshold limits,they should be similar | ||
# when we check by just origin or just destination or both origin-and-destination | ||
self.assertTrue(similarOD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (future fix) if this fails, there is no message which indicates why. and given that it is an assertTrue
, you won't even get the autogenerated message.
IsSimilar = metric.similar(trip0_coord,trip1_coord, similarity_threshold,cw) | ||
# Two trips with neither origin nor destination coordinates within the threshold | ||
# must not be similar by any configuration of similarity testing. | ||
self.assertFalse(IsSimilar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
||
for cw in parameters: | ||
with self.subTest(clustering_way=cw): | ||
IsSimilar = metric.similar(trip0_coord,trip1_coord, similarity_threshold,cw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (future fix) also IsSimilar
-> isSimilar
I have some small cleanup fixes that you can address in the next PR. I plan to merge this once all the tests are done. |
@humbleOldSage I expect you to address the cleanup comments in a subsequent PR |
The following changes reduces
e-mission-server-eval-private-data
'sTRB_label_assist
dependence on custom branch (hlu09's branch).The
clustering.py
file (from the commit e-mission/e-mission-eval-private-data@88988d3 ) passes theconfig['clustering_way']
togreedy_similarity_binning.py
as config parameter.