WordPress-based Crawler Implementation #131

ashfame · 2024-11-27T18:26:41Z

WordPress-based Crawler Implementation

Still a Work in Progress!

Overview

Implements a web crawler using WordPress as the queue backend for resilient, resumable crawling operations.

Architecture

Decoupled Crawling Logic: Core crawler delegates queue management to WordPress backend via API endpoints for URL fetching and discovery storage
Stateless Operation: All crawl state persists in WordPress, enabling automatic resume after interruptions
Browser-Native Parsing: Leverages browser's HTML parser for maximum compatibility
Controlled Performance: Built-in rate limiting (1 req/sec default) with dynamic adjustment based on 429 responses
Message Bus Integration: Crawling requests routed through existing message bus infrastructure

Usage Example

async function initializeCrawler(): Promise<void> {
    const crawler = new Crawler();
    crawler.setProcessFunction(async (html: string) => {
        console.log('Processing page HTML:', html.length);
    });
    await crawler.start();
}

TODOs

429 Handling
Have HTML for the page returned to us via our Bus
Implement backend endpoints

…llers

adamziel · 2024-11-29T11:54:47Z

src/crawler/crawler.ts

+		this.state.nextProcessTime = now + delayMs;
+	}
+
+	private extractLinks( htmlString: string ): string[] {


We have a similar plumbing in PHP:

$p = new WP_Block_Markup_Url_Processor( $options['block_markup'], $options['base_url'] ); while ( $p->next_url() ) { $parsed_url = $p->get_parsed_url(); foreach ( $url_mapping as $mapping ) { if ( url_matches( $parsed_url, $mapping['from_url'] ) ) { $p->replace_base_url( $mapping['to_url'] ); break; } } }

See how it also matches the domains and paths to stay within the same site. It might be handy to delegate that work to PHP.

adamziel · 2024-11-29T11:56:10Z

src/crawler/crawler.ts

+			);
+	}
+
+	private async queueUrls(


We're building just that in the PHP plugin! :-) There are concurrent requests, I'm exploring resource limits, and if we run it in Playground, we'll still do the downloads via fetch() which means we benefit from authorized cookies.

adamziel · 2024-11-29T11:58:37Z

src/plugin/class-storage.php

-			'supports'            => $this->custom_post_types_supports,
-			'labels'              => $this->get_post_type_registration_labels( $name, $name_plural ),
-			'rest_base'           => $this->post_type,
+		register_post_status(


It feels similar to the WP_Stream_Importer that fetches the assets and incrementally inserts WordPress entities such as posts, pages, tags etc. It seems like we're duplicating efforts – let's find a way to converge and build a more general solution that would support both regular imports and what this extension needs to do.

ashfame self-assigned this Nov 27, 2024

ashfame force-pushed the crawler branch from cc3d7ec to fd31d32 Compare November 27, 2024 18:29

define crawler with backend as queue storage

ccab5ae

ashfame force-pushed the crawler branch from fd31d32 to ccab5ae Compare November 27, 2024 20:55

ashfame changed the base branch from trunk to docs/extend November 27, 2024 20:56

ashfame added 2 commits November 28, 2024 14:23

define post type for storing crawler queue

695edaa

define Controller Registry to manage instances of all REST API contro…

9a15f48

…llers

adamziel reviewed Nov 29, 2024

View reviewed changes

wip

ba0be4d

ashfame force-pushed the crawler branch from 00fa33a to ba0be4d Compare December 2, 2024 08:51

psrpinto mentioned this pull request Dec 5, 2024

Crawling #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordPress-based Crawler Implementation #131

WordPress-based Crawler Implementation #131

ashfame commented Nov 27, 2024 •

edited

Loading

adamziel Nov 29, 2024

adamziel Nov 29, 2024 •

edited

Loading

adamziel Nov 29, 2024 •

edited

Loading

WordPress-based Crawler Implementation #131

Are you sure you want to change the base?

WordPress-based Crawler Implementation #131

Conversation

ashfame commented Nov 27, 2024 • edited Loading

WordPress-based Crawler Implementation

Overview

Architecture

Usage Example

TODOs

adamziel Nov 29, 2024

Choose a reason for hiding this comment

adamziel Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

ashfame commented Nov 27, 2024 •

edited

Loading

adamziel Nov 29, 2024 •

edited

Loading

adamziel Nov 29, 2024 •

edited

Loading