Minor optimizations and fixes #130

vbarbaresi · 2021-10-31T18:48:15Z

Description

I don't expect a major performance improvement from these little changes, these are small nits.

To get a major improvement, I think iterating on documents in a single-pass (using etree.iterparse instead of using lots of etree.xpath) would be the way to go

Detais

add utils.unifiqy_list() to factor list(OrderedDict.from_keys())
It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple dict
use generators in extract_comments() instead of building lists
use sets instead of lists for MANUALLY_STRIPPED and MANUALLY_CLEANED
using .remove() on a list is slow (O(n)) so a set seemed more appropriate. It probably doesn't make a difference given the size of these sequences though..
pass *seq in etree.strip_tags(tree, seq)
It seems that it works with a list too but it shouldn't according to the method signature

Additional notes

Note on uniqify benchmark: I used this simple benchmark:

import random
import time

seq = [random.randrange(0,100000) for i in range(1000000)]

t = time.time()
dedup = list(dict.fromkeys(seq))
print(time.time() - t)

to compare OrderedDict and Dict
and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088

I don't expect a major performance improvement from these little changes, these are small nits. To get a major improvement, I think iterating on documents in a single-pass (using `etree.iterparse` instead of using lots of `etree.xpath`) would be the way to go - add `utils.unifiqy_list()` to factor `list(OrderedDict.from_keys())` It turns out that we don't need OrderedDict anymore for this since Python 3.6 and it's much faster with a simple `dict` - use generators in `extract_comments()` instead of building lists - use sets instead of lists for `MANUALLY_STRIPPED` and `MANUALLY_CLEANED` using ` .remove()` on a list is slow (`O(n)`) so a set seemed more appropriate - pass `*seq` in `etree.strip_tags(tree, seq)` It seems that it works with a list too but it shouldn't according to the method signature Note on uniqify benchmark: I used this simple benchmark: ```python import random import time seq = [random.randrange(0,100000) for i in range(1000000)] t = time.time() dedup = list(dict.fromkeys(seq)) print(time.time() - t) ``` to compare OrderedDict and Dict and also tried using lists with random strings inspired by this code: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088

vbarbaresi · 2021-10-31T18:51:20Z

This Sourcery bot is a quite invasive! Re-opening a PR with my code 😅

sourcery-ai · 2021-10-31T19:11:15Z

Sourcery Code Quality Report

✅ Merging this PR will increase code quality in the affected files by 0.10%.

Quality metrics	Before	After	Change
Complexity	17.00 🙂	16.94 🙂	-0.06 👍
Method Length	94.83 🙂	94.10 🙂	-0.73 👍
Working memory	11.46 😞	11.45 😞	-0.01 👍
Quality	50.80% 🙂	50.90% 🙂	0.10% 👍

Other metrics	Before	After	Change
Lines	2272	2286	14

Changed files	Quality Before	Quality After	Quality Change
trafilatura/cli_utils.py	60.34% 🙂	60.36% 🙂	0.02% 👍
trafilatura/core.py	33.82% 😞	33.84% 😞	0.02% 👍
trafilatura/downloads.py	66.24% 🙂	66.31% 🙂	0.07% 👍
trafilatura/external.py	72.91% 🙂	72.42% 🙂	-0.49% 👎
trafilatura/htmlprocessing.py	50.55% 🙂	50.37% 🙂	-0.18% 👎
trafilatura/settings.py	78.43% ⭐	78.43% ⭐	0.00%
trafilatura/utils.py	63.54% 🙂	64.09% 🙂	0.55% 👍

Here are some functions in these files that still need a tune-up:

File	Function	Complexity	Length	Working Memory	Quality	Recommendation
trafilatura/core.py	bare_extraction	39 ⛔	477 ⛔	27 ⛔	7.88% ⛔	Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py	extract_content	39 ⛔	363 ⛔	16 ⛔	15.16% ⛔	Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py	handle_table	44 ⛔	228 ⛔	19 ⛔	16.00% ⛔	Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py	handle_paragraphs	51 ⛔	291 ⛔	14 😞	16.56% ⛔	Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
trafilatura/core.py	compare_extraction	23 😞	291 ⛔	22 ⛔	20.23% ⛔	Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

⭐ excellent
🙂 good
😞 poor
⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

codecov-commenter · 2021-10-31T19:13:51Z

Codecov Report

Merging #130 (e6172b0) into master (113d73f) will increase coverage by 0.05%.
The diff coverage is 93.10%.

@@            Coverage Diff             @@
##           master     #130      +/-   ##
==========================================
+ Coverage   94.51%   94.56%   +0.05%     
==========================================
  Files          19       19              
  Lines        2660     2667       +7     
==========================================
+ Hits         2514     2522       +8     
+ Misses        146      145       -1

Impacted Files	Coverage Δ
trafilatura/utils.py	`95.69% <80.00%> (-0.44%)`	⬇️
trafilatura/core.py	`96.78% <83.33%> (ø)`
trafilatura/cli_utils.py	`92.78% <100.00%> (+1.00%)`	⬆️
trafilatura/downloads.py	`93.02% <100.00%> (ø)`
trafilatura/external.py	`93.10% <100.00%> (+0.08%)`	⬆️
trafilatura/htmlprocessing.py	`95.95% <100.00%> (ø)`
trafilatura/settings.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 113d73f...e6172b0. Read the comment docs.

adbar · 2021-11-01T12:30:29Z

Yes, the Sourcery bot was too much, I switched it off and will now close PR #131.

adbar · 2021-11-01T12:43:23Z

Thanks!
No changes in speed or accuracy on my data, although deletion order of HTML elements could matter.
I edit the PR to accept changes and rollback others.

sourcery-ai bot mentioned this pull request Oct 31, 2021

Minor optimizations and fixes (Sourcery refactored) #131

Closed

fix set update: use an iterable otherwise it iterates on string chars

e6172b0

adbar mentioned this pull request Nov 1, 2021

Drop support for Python 3.5 #132

Closed

adbar and others added 5 commits November 1, 2021 13:45

utils.py: imports order

4233dbe

core.py: rollback deletions to list

13ddfa0

rollback: sets to lists

021e6f3

typo: uniqify → uniquify

330640d

+ comment

c2ccb63

adbar merged commit 4e0099f into adbar:master Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor optimizations and fixes #130

Minor optimizations and fixes #130

vbarbaresi commented Oct 31, 2021

vbarbaresi commented Oct 31, 2021

sourcery-ai bot commented Oct 31, 2021

codecov-commenter commented Oct 31, 2021 •

edited

Loading

adbar commented Nov 1, 2021

adbar commented Nov 1, 2021

Minor optimizations and fixes #130

Minor optimizations and fixes #130

Conversation

vbarbaresi commented Oct 31, 2021

Description

Detais

Additional notes

vbarbaresi commented Oct 31, 2021

sourcery-ai bot commented Oct 31, 2021

Sourcery Code Quality Report

Legend and Explanation

codecov-commenter commented Oct 31, 2021 • edited Loading

Codecov Report

adbar commented Nov 1, 2021

adbar commented Nov 1, 2021

codecov-commenter commented Oct 31, 2021 •

edited

Loading