-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WordPress-based Crawler Implementation #131
base: docs/extend
Are you sure you want to change the base?
Conversation
this.state.nextProcessTime = now + delayMs; | ||
} | ||
|
||
private extractLinks( htmlString: string ): string[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar plumbing in PHP:
$p = new WP_Block_Markup_Url_Processor( $options['block_markup'], $options['base_url'] );
while ( $p->next_url() ) {
$parsed_url = $p->get_parsed_url();
foreach ( $url_mapping as $mapping ) {
if ( url_matches( $parsed_url, $mapping['from_url'] ) ) {
$p->replace_base_url( $mapping['to_url'] );
break;
}
}
}
See how it also matches the domains and paths to stay within the same site. It might be handy to delegate that work to PHP.
); | ||
} | ||
|
||
private async queueUrls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're building just that in the PHP plugin! :-) There are concurrent requests, I'm exploring resource limits, and if we run it in Playground, we'll still do the downloads via fetch() which means we benefit from authorized cookies.
'supports' => $this->custom_post_types_supports, | ||
'labels' => $this->get_post_type_registration_labels( $name, $name_plural ), | ||
'rest_base' => $this->post_type, | ||
register_post_status( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels similar to the WP_Stream_Importer that fetches the assets and incrementally inserts WordPress entities such as posts, pages, tags etc. It seems like we're duplicating efforts – let's find a way to converge and build a more general solution that would support both regular imports and what this extension needs to do.
WordPress-based Crawler Implementation
Still a Work in Progress!
Overview
Implements a web crawler using WordPress as the queue backend for resilient, resumable crawling operations.
Architecture
Usage Example
TODOs