Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cheap isFill check and add
--skip-filled-tiles
option #234Cheap isFill check and add
--skip-filled-tiles
option #234Changes from 1 commit
eb43aa3
2d924c4
cfc8731
eb4ca2a
15029c2
9236698
de672c2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbilger what do you think of using this here instead of the feature count threshold?
The only issue I'm thinking is there are currently seams where ocean polygons overlap where any tile containing the seam won't be just fills but they will be repeated. Another option there would be to merge these ocean polygons (implemented in #235)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msbarry hard to say w/o a benchmark for encoding but that seems a bit faster. I ran it for Australia twice on this branch and on the main branch. The downside is that the file size increases by almost 100MB since some opportunities to dedupe seem to be missed. So it depends on the goal whether to increase performance a bit or to reduce the file size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, part of this might that I landed #235 but didn't merge it into this branch until just now - the water polygon seams add quite a bit of size and merging them in encode adds a bit of extra time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, after rebasing the results are almost identical, so don't have any concerns then.
(it doesn't matter at all but just to mention: file size is 10MB larger with containsOnlyPolygonFills)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use num features < 5 OR containsOnlyPolygonFills. On Australia, there are ~40k unique fill tiles, but 800k unique tiles with < 5 features. It's probably worth the slight file size increase to lower the risk of a hash collision...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counted in #tileEncoder like so
To start with, I don't fully understand the motivation for replacing
tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD
bycontainsOnlyPolygonFills
. The threshold check seems simple enough. But I assume you want to avoid unnecessary hash code generation & lookups.So on the one hand: with the threshold approach we hash twice, and more than required but I think it should be fast enough, and we can remove a few more dupes (~10MB less).
On the other hand with
containsOnlyPolygonFills
we miss a bunch of dupes, and this check also comes at a cost - but w/o a benchmark it's hard to say which is better/worse - would assume both to be more or less identical overall.So I'd tend to just keep
tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD
. But if avoiding hash generation/lookup is the main motivation and since the few missed dupes are neglectable, I would go withif (compactDb && tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD && en.containsOnlyPolygonFills())
(i.e. first check threshold to avoid uneccessary #containsOnlyPolygonFills checks)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought here was that "is fill" would be a more accurate indicator for whether a tile would be repeated a lot than the feature count heuristic. For example some future tileset where there are >5 polygons overlapping for a large portion of the planet.
For Australia (16,169,927 tiles), it looks like:
So the isFill check catches 99.8% of repeated tiles and there end up being 41,350 distinct repeated tiles, but the <5 heuristic catches nearly 100% of repeated tiles, but there end up being 810,670 (20x more) distinct tiles. So I guess it's a tradeoff of whether we'd rather get the additional 0.2% to lower the mbtiles size slightly or if we want to minimize the chances of a hash collision by deduping 20x fewer tile hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the additional explanation and crunching more numbers.
So given that the "is fill"-check should be cheap&fast like you said, and
and
and
are all very good arguments to me to go with your suggestion to use "is fill" instead - and keep the #features out of the equation entirely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, I was trying to decide whether
isFill() || numFeatures() <= N
would be better to catch more repeated tiles, but even just using the lowest threshold of 1 increases # distinct tiles being hashed to 267k (6-7x more) and only catches 17% of the remaining tiles, so it's probably not worth it.If we wanted to catch more of the repeated tiles, it would probably be better to expand the isFill logic to also identify if the tile contains only edge of a rectangle or segment of a vertical/horizontal line (for example if a tileset includes latitude/longitude graticules).
I double checked, visualvm shows 100ms on my macbook for australia and it's only taking ~100ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the "isEdge" check on lines and rectangle catches almost all of the other repeated tiles and only increases # distinct tiles to track to 53,491 (30% increase), so I'll add that in to be safe and cover future cases with graticules.