Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev] Optimize batch fetch method to boost throughput #269

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

NiuBlibing
Copy link

@NiuBlibing NiuBlibing commented Feb 8, 2023

Description

The previous start url fetching method only working when spider is idle, which is not full concurrency.This patch optimizes it by using request_left_downloader signal.

There maybe need a lock for calculating the need_size.

Fixes #119

How Has This Been Tested?

  • tox -e py38 -- tests/

Test Configuration:

  • OS version: debian 11 and Windows10
  • Necessary Libraries (optional):

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

NiuBlibing and others added 2 commits February 8, 2023 16:23
The previous start url fetching method only working when spider
is idle, which is not full concurrency.This patch optimizes it by
using request_left_downloader signal.

Signed-off-by: Tianyue Ren <[email protected]>
@LuckyPigeon
Copy link
Collaborator

LuckyPigeon commented Jun 10, 2023

@NiuBlibing Please resolve the assertion error. And add unit test for fill_requests_queue, thanks!

@LuckyPigeon LuckyPigeon changed the title [patch v1] optimize batch fetch method to boost throughput [dev] Optimize batch fetch method to boost throughput Jun 21, 2023
@LuckyPigeon LuckyPigeon self-assigned this Jun 21, 2023
@LuckyPigeon LuckyPigeon enabled auto-merge (squash) June 21, 2023 02:27
@LuckyPigeon LuckyPigeon disabled auto-merge June 21, 2023 02:28
@LuckyPigeon LuckyPigeon enabled auto-merge (squash) June 21, 2023 02:28
@LuckyPigeon
Copy link
Collaborator

LuckyPigeon commented Jun 21, 2023

@rmax How do you think about this implementation, it disabled spider_idel usage. I wonder if we need a switch between spider_idle and fill_requests_queue.

@LuckyPigeon LuckyPigeon requested a review from rmax June 21, 2023 02:30
@rmax
Copy link
Owner

rmax commented Jun 21, 2023

@rmax How do you think about this implementation, it disabled spider_idel usage. I wonder if we need a switch between spider_idle and fill_requests_queue.

Interesting the use of the other signal. What scrapy version is required for the new signal?

What happens with existing users that override the spider_idle method?

Does it make sense to bump the major version? Or somewhat related, shall we migrate to calendar versioning?

@rmax rmax force-pushed the master branch 2 times, most recently from daccc92 to 3245d28 Compare July 6, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggesting new way to schedule requests
3 participants