Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordPress-based Crawler Implementation #131

Draft
wants to merge 4 commits into
base: docs/extend
Choose a base branch
from
Draft

Conversation

ashfame
Copy link
Member

@ashfame ashfame commented Nov 27, 2024

WordPress-based Crawler Implementation

Still a Work in Progress!

Overview

Implements a web crawler using WordPress as the queue backend for resilient, resumable crawling operations.

Architecture

  • Decoupled Crawling Logic: Core crawler delegates queue management to WordPress backend via API endpoints for URL fetching and discovery storage
  • Stateless Operation: All crawl state persists in WordPress, enabling automatic resume after interruptions
  • Browser-Native Parsing: Leverages browser's HTML parser for maximum compatibility
  • Controlled Performance: Built-in rate limiting (1 req/sec default) with dynamic adjustment based on 429 responses
  • Message Bus Integration: Crawling requests routed through existing message bus infrastructure

Usage Example

async function initializeCrawler(): Promise<void> {
    const crawler = new Crawler();
    crawler.setProcessFunction(async (html: string) => {
        console.log('Processing page HTML:', html.length);
    });
    await crawler.start();
}

TODOs

  • 429 Handling
  • Have HTML for the page returned to us via our Bus
  • Implement backend endpoints

@ashfame ashfame changed the base branch from trunk to docs/extend November 27, 2024 20:56
this.state.nextProcessTime = now + delayMs;
}

private extractLinks( htmlString: string ): string[] {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar plumbing in PHP:

	$p = new WP_Block_Markup_Url_Processor( $options['block_markup'], $options['base_url'] );
	while ( $p->next_url() ) {
		$parsed_url = $p->get_parsed_url();
		foreach ( $url_mapping as $mapping ) {
			if ( url_matches( $parsed_url, $mapping['from_url'] ) ) {
				$p->replace_base_url( $mapping['to_url'] );
				break;
			}
		}
	}

See how it also matches the domains and paths to stay within the same site. It might be handy to delegate that work to PHP.

);
}

private async queueUrls(
Copy link

@adamziel adamziel Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're building just that in the PHP plugin! :-) There are concurrent requests, I'm exploring resource limits, and if we run it in Playground, we'll still do the downloads via fetch() which means we benefit from authorized cookies.

'supports' => $this->custom_post_types_supports,
'labels' => $this->get_post_type_registration_labels( $name, $name_plural ),
'rest_base' => $this->post_type,
register_post_status(
Copy link

@adamziel adamziel Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels similar to the WP_Stream_Importer that fetches the assets and incrementally inserts WordPress entities such as posts, pages, tags etc. It seems like we're duplicating efforts – let's find a way to converge and build a more general solution that would support both regular imports and what this extension needs to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants