You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is also the ScrollSpout but its purpose is a bit different. It helps reindexing the content of a status index, it is not meant to be used for crawling.
All three implementations use Elasticsearch's sharding. When using more than one spout task, each one will communicate exclusively with a shard of the status index. This allows a good diversity of sources for the crawler as the URLs for a given hostname (or domain or IP) will be routed to the same shard. It also guarantees from a politeness point of view, that the crawler will not get too many URLs from the same source and so that they won't queue up in the Fetcher bolt unnecessarily. On the down side, routing the URLs to shards based on their hostnames (or worse by paid level domain) can lead to shards of uneven sizes with some of them reaching gigantic proportions. This can cause problems with disk space on the Elastic nodes and shard reallocations. Not fun.
The AggregationSpout works by running aggregations queries. These simply return the top N URLs per bucket (e.g. hostname), we hence get both a list of buckets as well as some URLs due to be fetched for each one of them. This is great! We get a good diversity of URLs and don't get too many of them. The trouble with the aggregations is that they operate on the whole set of documents matching a query, in our case, "get me all the documents due for fetching before date D" and as the crawl frontier expands and the status index gets very large, these queries can return a massive number of documents and as a result, the aggregations can take a long time, during which the crawler gets starved of input URLs.
As a way of speeding things up, we can use the sampler aggregations (es.status.sample in the config). They do make things a bit quicker but we then lose the guarantee to fetch the earliest URLs and / or buckets. Others strategies have consisted in reusing the nextFetchDates from previous queries so that the search caches in Elastic can be reused. These features make the code and configuration of the Spout quite complex.
As an alternative, we wrote the CollapsingSpout, which as its names suggests, leverages the field collapsing feature of Elasticsearch. This was promising as it meant less configuration and the possibility of navigating through pages of results but in practice, the performance was even worse than the AggregationSpout. See this discussion on the Elastic forum. If I understood correctly (need to check with the source), it is because this feature is inherited from Lucene and requires running the queries twice.
Finally, we have the HybridSpout. It extends the AggregationSpout by running aggregations less frequently, in order to get a list of buckets and URLs for them then running simple Term Queries for each bucket whenever the URLs we had in a cache for it have all been sent down the crawl topology. This means that we are sending lighter queries to Elastic, constantly retrieving content from it instead of having massive aggregations. This works fine but means even more configuration and tinkering with the values until we get to the results we expect. The code is also quite complex and there is greater risk of something wrong with the multi-threading aspects of it.
An additional problem is that any URLs added for a brand new bucket in the status index won't be known until the aggregations are run, which can be some time.
A new approach?
While URLFrontier is getting better and better and, hopefully, will get used with other web crawlers, we can expect a large number of existing StormCrawler users to continue using the Elasticsearch module. Mainly because that's what they already use and have invested quite some time in it but also because it offers great visibility into the data thanks to Grafana / Kibana. Despite the issues and limitations highlighted above, the performance of the spouts for Elastic is also satisfactory and has enabled several StormCrawler users to crawl pages in the order of billions!
There is however some inspiration that we can take from URLFrontier, in particular the Opensearch based implementation developed by DigitalPebble Ltd for Presearch. Without going into details, what it does is that it has two different indices, one for the URLs, another listing the queues or bucket names (e.g. the hostnames).
If a similar approach was used in StormCrawler, a Spout could periodically check for the list of queues in a given shard and, separately issue light TermQueries for each queue to feed the crawler. Retrieving the list of queues should be pretty fast compared to running aggregations - especially if we keep track of when a given queue was first seen. The code should also be a bit simpler and require less configuration.
In addition, since the queues and URLs are kept in separate indices, it won't become necessary to route the URLs into shards. Only the queue documents would get routed. No more fat shards (but querying URLs for a given queue across the Elastic nodes would not be without impact).
Porting an existing status index to the new format would also be straightforward, a script could create the queues index from the content of the status index. I expect that there will be an implementation of URLFrontier using Elasticsearch at some point, if there is, it will certainly use this dual index approach. By adopting it in StormCrawler now, users would be able to future proof their crawls and expose the content via the URLFrontier API later on, thus making it available to other crawlers in other programming languages at no extra cost.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The present and how we got where we are
We currently have 3 implementations of Spouts in the Elasticsearch module (by chronological order):
There is also the ScrollSpout but its purpose is a bit different. It helps reindexing the content of a status index, it is not meant to be used for crawling.
All three implementations use Elasticsearch's sharding. When using more than one spout task, each one will communicate exclusively with a shard of the status index. This allows a good diversity of sources for the crawler as the URLs for a given hostname (or domain or IP) will be routed to the same shard. It also guarantees from a politeness point of view, that the crawler will not get too many URLs from the same source and so that they won't queue up in the Fetcher bolt unnecessarily. On the down side, routing the URLs to shards based on their hostnames (or worse by paid level domain) can lead to shards of uneven sizes with some of them reaching gigantic proportions. This can cause problems with disk space on the Elastic nodes and shard reallocations. Not fun.
The AggregationSpout works by running aggregations queries. These simply return the top N URLs per bucket (e.g. hostname), we hence get both a list of buckets as well as some URLs due to be fetched for each one of them. This is great! We get a good diversity of URLs and don't get too many of them. The trouble with the aggregations is that they operate on the whole set of documents matching a query, in our case, "get me all the documents due for fetching before date D" and as the crawl frontier expands and the status index gets very large, these queries can return a massive number of documents and as a result, the aggregations can take a long time, during which the crawler gets starved of input URLs.
As a way of speeding things up, we can use the sampler aggregations (es.status.sample in the config). They do make things a bit quicker but we then lose the guarantee to fetch the earliest URLs and / or buckets. Others strategies have consisted in reusing the nextFetchDates from previous queries so that the search caches in Elastic can be reused. These features make the code and configuration of the Spout quite complex.
As an alternative, we wrote the CollapsingSpout, which as its names suggests, leverages the field collapsing feature of Elasticsearch. This was promising as it meant less configuration and the possibility of navigating through pages of results but in practice, the performance was even worse than the AggregationSpout. See this discussion on the Elastic forum. If I understood correctly (need to check with the source), it is because this feature is inherited from Lucene and requires running the queries twice.
Finally, we have the HybridSpout. It extends the AggregationSpout by running aggregations less frequently, in order to get a list of buckets and URLs for them then running simple Term Queries for each bucket whenever the URLs we had in a cache for it have all been sent down the crawl topology. This means that we are sending lighter queries to Elastic, constantly retrieving content from it instead of having massive aggregations. This works fine but means even more configuration and tinkering with the values until we get to the results we expect. The code is also quite complex and there is greater risk of something wrong with the multi-threading aspects of it.
An additional problem is that any URLs added for a brand new bucket in the status index won't be known until the aggregations are run, which can be some time.
A new approach?
While URLFrontier is getting better and better and, hopefully, will get used with other web crawlers, we can expect a large number of existing StormCrawler users to continue using the Elasticsearch module. Mainly because that's what they already use and have invested quite some time in it but also because it offers great visibility into the data thanks to Grafana / Kibana. Despite the issues and limitations highlighted above, the performance of the spouts for Elastic is also satisfactory and has enabled several StormCrawler users to crawl pages in the order of billions!
There is however some inspiration that we can take from URLFrontier, in particular the Opensearch based implementation developed by DigitalPebble Ltd for Presearch. Without going into details, what it does is that it has two different indices, one for the URLs, another listing the queues or bucket names (e.g. the hostnames).
If a similar approach was used in StormCrawler, a Spout could periodically check for the list of queues in a given shard and, separately issue light TermQueries for each queue to feed the crawler. Retrieving the list of queues should be pretty fast compared to running aggregations - especially if we keep track of when a given queue was first seen. The code should also be a bit simpler and require less configuration.
In addition, since the queues and URLs are kept in separate indices, it won't become necessary to route the URLs into shards. Only the queue documents would get routed. No more fat shards (but querying URLs for a given queue across the Elastic nodes would not be without impact).
Porting an existing status index to the new format would also be straightforward, a script could create the queues index from the content of the status index. I expect that there will be an implementation of URLFrontier using Elasticsearch at some point, if there is, it will certainly use this dual index approach. By adopting it in StormCrawler now, users would be able to future proof their crawls and expose the content via the URLFrontier API later on, thus making it available to other crawlers in other programming languages at no extra cost.
Beta Was this translation helpful? Give feedback.
All reactions