Frontera (web crawling)
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)(Learn how and when to remove this template message)
|Original author(s)||Alexander Sibiryakov, Javier Casas|
|Developer(s)||Scrapinghub Ltd., GitHub community|
|Initial release||November 1, 2014|
v0.8.1 / April 6, 2019
|Operating system||OS X, Linux|
|License||BSD 3-clause license|
The content and structure of the World Wide Web changes rapidly. Frontera is designed to be able to adapt quickly to these changes. Most large scale web crawlers operate in batch mode with sequential phases of injection, fetching, parsing, deduplication, and scheduling. This leads to a delay in updating the crawl when the web changes. The design is mostly motivated by the relatively low random access performance of hard disks compared to sequential access. Frontera instead relies on modern key value storage systems, using efficient data structures and powerful hardware to crawling, parsing and schedule indexing of new links concurrently. It's an open-source project designed to fit various use cases, with high flexibility and configurability.
Large-scale web crawls are Frontera's only purpose. Its flexibility allows crawls of moderate size on a single machine with a few cores by leveraging single process and distributed spiders run modes.
- Online operation: small requests batches, with parsing done right after fetch.
- Pluggable backend architecture: low-level storage logic is separated from crawling policy.
- Three run modes: single process, distributed spiders, distributed backend and spiders.
- Transparent data flow, allowing to integrate custom components easily.
- Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
- SQLAlchemy and HBase storage backends.
- Revisiting logic (only with RDBMS backend).
- Optional use of Scrapy for fetching and parsing.
- BSD 3-clause license, allowing to use in any commercial product.
- Python 3 support.
Comparison to other web crawlersEdit
Although, Frontera isn't a web crawler itself, it requires a streaming crawling architecture rather than a batch crawling approach.
StormCrawler is another stream-oriented crawler built on top of Apache Storm whilst using some components from the Apache Nutch ecosystem. Scrapy Cluster was designed by ISTResearch with precise monitoring and management of the queue in mind. These systems provide fetching and/or queueing mechanisms, but no link database or content processing.
At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.
Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.
First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.
In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue was producing batches with limits on a number of hosts and requests per host.
Next significant milestone of Frontera development was the introduction of crawling strategy and strategy worker, along with abstraction of the message bus. It became possible to code the custom crawling strategy without dealing with low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.
- Frontera documentation at ReadTheDocs.
- Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.
- Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.
- "Open Catalog, Memex (Domain-Specific Search)".