Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor optimizations and fixes #130

Merged
merged 7 commits into from
Nov 1, 2021
Merged

Conversation

vbarbaresi
Copy link
Contributor

Description

I don't expect a major performance improvement from these little changes, these are small nits.

To get a major improvement, I think iterating on documents in a single-pass (using etree.iterparse instead of using lots of etree.xpath) would be the way to go

Detais

  • add utils.unifiqy_list() to factor list(OrderedDict.from_keys())
    It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple dict

  • use generators in extract_comments() instead of building lists

  • use sets instead of lists for MANUALLY_STRIPPED and MANUALLY_CLEANED
    using  .remove() on a list is slow (O(n)) so a set seemed more appropriate. It probably doesn't make a difference given the size of these sequences though..

  • pass *seq in etree.strip_tags(tree, seq)
    It seems that it works with a list too but it shouldn't according to the method signature

Additional notes

Note on uniqify benchmark: I used this simple benchmark:

import random
import time

seq = [random.randrange(0,100000) for i in range(1000000)]

t = time.time()
dedup = list(dict.fromkeys(seq))
print(time.time() - t)

to compare OrderedDict and Dict
and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088

I don't expect a major performance improvement from these little changes, these are small nits.

To get a major improvement, I think iterating on documents in a single-pass (using `etree.iterparse` instead of using lots of `etree.xpath`) would be the way to go

- add `utils.unifiqy_list()` to factor `list(OrderedDict.from_keys())`
It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple `dict`

- use generators in `extract_comments()` instead of building lists

- use sets instead of lists for `MANUALLY_STRIPPED` and `MANUALLY_CLEANED`
  using ` .remove()` on a list is slow (`O(n)`) so a set seemed more appropriate

- pass `*seq` in `etree.strip_tags(tree, seq)`
  It seems that it works with a list too but it shouldn't according to the method signature

Note on uniqify benchmark: I used this simple benchmark:

```python
import random
import time

seq = [random.randrange(0,100000) for i in range(1000000)]

t = time.time()
dedup = list(dict.fromkeys(seq))
print(time.time() - t)
```

to compare OrderedDict and Dict
and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088
@vbarbaresi
Copy link
Contributor Author

This Sourcery bot is a quite invasive! Re-opening a PR with my code 😅

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 31, 2021

Sourcery Code Quality Report

✅  Merging this PR will increase code quality in the affected files by 0.10%.

Quality metrics Before After Change
Complexity 17.00 🙂 16.94 🙂 -0.06 👍
Method Length 94.83 🙂 94.10 🙂 -0.73 👍
Working memory 11.46 😞 11.45 😞 -0.01 👍
Quality 50.80% 🙂 50.90% 🙂 0.10% 👍
Other metrics Before After Change
Lines 2272 2286 14
Changed files Quality Before Quality After Quality Change
trafilatura/cli_utils.py 60.34% 🙂 60.36% 🙂 0.02% 👍
trafilatura/core.py 33.82% 😞 33.84% 😞 0.02% 👍
trafilatura/downloads.py 66.24% 🙂 66.31% 🙂 0.07% 👍
trafilatura/external.py 72.91% 🙂 72.42% 🙂 -0.49% 👎
trafilatura/htmlprocessing.py 50.55% 🙂 50.37% 🙂 -0.18% 👎
trafilatura/settings.py 78.43% ⭐ 78.43% ⭐ 0.00%
trafilatura/utils.py 63.54% 🙂 64.09% 🙂 0.55% 👍

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
trafilatura/core.py bare_extraction 39 ⛔ 477 ⛔ 27 ⛔ 7.88% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py extract_content 39 ⛔ 363 ⛔ 16 ⛔ 15.16% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py handle_table 44 ⛔ 228 ⛔ 19 ⛔ 16.00% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py handle_paragraphs 51 ⛔ 291 ⛔ 14 😞 16.56% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py compare_extraction 23 😞 291 ⛔ 22 ⛔ 20.23% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2021

Codecov Report

Merging #130 (e6172b0) into master (113d73f) will increase coverage by 0.05%.
The diff coverage is 93.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #130      +/-   ##
==========================================
+ Coverage   94.51%   94.56%   +0.05%     
==========================================
  Files          19       19              
  Lines        2660     2667       +7     
==========================================
+ Hits         2514     2522       +8     
+ Misses        146      145       -1     
Impacted Files Coverage Δ
trafilatura/utils.py 95.69% <80.00%> (-0.44%) ⬇️
trafilatura/core.py 96.78% <83.33%> (ø)
trafilatura/cli_utils.py 92.78% <100.00%> (+1.00%) ⬆️
trafilatura/downloads.py 93.02% <100.00%> (ø)
trafilatura/external.py 93.10% <100.00%> (+0.08%) ⬆️
trafilatura/htmlprocessing.py 95.95% <100.00%> (ø)
trafilatura/settings.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 113d73f...e6172b0. Read the comment docs.

@adbar adbar mentioned this pull request Nov 1, 2021
@adbar
Copy link
Owner

adbar commented Nov 1, 2021

Yes, the Sourcery bot was too much, I switched it off and will now close PR #131.

@adbar
Copy link
Owner

adbar commented Nov 1, 2021

Thanks!
No changes in speed or accuracy on my data, although deletion order of HTML elements could matter.
I edit the PR to accept changes and rollback others.

@adbar adbar merged commit 4e0099f into adbar:master Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants