-
-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor optimizations and fixes #130
Conversation
I don't expect a major performance improvement from these little changes, these are small nits. To get a major improvement, I think iterating on documents in a single-pass (using `etree.iterparse` instead of using lots of `etree.xpath`) would be the way to go - add `utils.unifiqy_list()` to factor `list(OrderedDict.from_keys())` It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple `dict` - use generators in `extract_comments()` instead of building lists - use sets instead of lists for `MANUALLY_STRIPPED` and `MANUALLY_CLEANED` using ` .remove()` on a list is slow (`O(n)`) so a set seemed more appropriate - pass `*seq` in `etree.strip_tags(tree, seq)` It seems that it works with a list too but it shouldn't according to the method signature Note on uniqify benchmark: I used this simple benchmark: ```python import random import time seq = [random.randrange(0,100000) for i in range(1000000)] t = time.time() dedup = list(dict.fromkeys(seq)) print(time.time() - t) ``` to compare OrderedDict and Dict and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088
This Sourcery bot is a quite invasive! Re-opening a PR with my code 😅 |
Sourcery Code Quality Report✅ Merging this PR will increase code quality in the affected files by 0.10%.
Here are some functions in these files that still need a tune-up:
Legend and ExplanationThe emojis denote the absolute quality of the code:
The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request. Please see our documentation here for details on how these metrics are calculated. We are actively working on this report - lots more documentation and extra metrics to come! Help us improve this quality report! |
Codecov Report
@@ Coverage Diff @@
## master #130 +/- ##
==========================================
+ Coverage 94.51% 94.56% +0.05%
==========================================
Files 19 19
Lines 2660 2667 +7
==========================================
+ Hits 2514 2522 +8
+ Misses 146 145 -1
Continue to review full report at Codecov.
|
Yes, the Sourcery bot was too much, I switched it off and will now close PR #131. |
Thanks! |
Description
I don't expect a major performance improvement from these little changes, these are small nits.
To get a major improvement, I think iterating on documents in a single-pass (using
etree.iterparse
instead of using lots ofetree.xpath
) would be the way to goDetais
add
utils.unifiqy_list()
to factorlist(OrderedDict.from_keys())
It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple
dict
use generators in
extract_comments()
instead of building listsuse sets instead of lists for
MANUALLY_STRIPPED
andMANUALLY_CLEANED
using
.remove()
on a list is slow (O(n)
) so a set seemed more appropriate. It probably doesn't make a difference given the size of these sequences though..pass
*seq
inetree.strip_tags(tree, seq)
It seems that it works with a list too but it shouldn't according to the method signature
Additional notes
Note on uniqify benchmark: I used this simple benchmark:
to compare OrderedDict and Dict
and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088