RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2021 Volume 33, Issue 3, Pages 87–100 (Mi tisp601)

High performance distributed web-scraper

D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov

Peter the Great St.Petersburg Polytechnic University

Abstract: Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and ÑephFS. The results of solution testing are described in this article as well.

Keywords: web-scraping, web-crawling, distributed data collection, distributed data analysis.

Language: English

DOI: 10.15514/ISPRAS-2021-33(3)-7



© Steklov Math. Inst. of RAS, 2026