Automated web store product scraping using Node.js
Kallio, Aleksi (2015)
Kallio, Aleksi
2015
Tietotekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2015-06-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201505191312
https://urn.fi/URN:NBN:fi:tty-201505191312
Tiivistelmä
Different fields of electronic commerce have grown substantially in the last decade. This is mainly due to increased accessibility of internet and the improvements in other network technologies. Also, the abundance of mobile devices has made the electronic commerce easily accessible for everyone, from anywhere, at any time. The biggest form of electronic commerce is online shopping, which is a huge and steadily growing world wide business.
The growth of online shopping brings new possibilities for market research and behavioural research. The data from online shopping could, for example, be used to study price changes and commodity consumption across the globe. To study these globe wide phenomena, large quantities of online shopping data is needed. The product catalogues of the online stores are especially well suited for multitude of different researches. To gain large quantities of information from these product catalogues, it should be possible to acquire product catalogues from multiple stores automatically and reliable, over a significant timespan and for multiple consecutive times.
In this thesis a web store product scraper software, capable of collecting product catalogue information from several web stores, was implemented. The software was implemented using JavaScript programming language, NodeJS framework, MongoDB NoSQL database and multiple well proven software development architectures. The web store product scraper was configured and tested with several different settings on three different sized web stores. The results were promising. From each store a significant amount of products were scraped. The amounts were also in line with the sizes of the stores. The stores were scraped concurrently and simultaneously without supervision and with low impact on system resources.
Collecting product information from online stores is possible and well proven, even though collecting information from large web stores takes time. The information can be scraped concurrently and simultaneously from multiple web stores. Future work should be more concentrated on building a framework around the web store product scrapers than to optimise the system resource consumption. The framework should simplify the configuration and monitoring of multiple simultaneous web store product scrapers.
The growth of online shopping brings new possibilities for market research and behavioural research. The data from online shopping could, for example, be used to study price changes and commodity consumption across the globe. To study these globe wide phenomena, large quantities of online shopping data is needed. The product catalogues of the online stores are especially well suited for multitude of different researches. To gain large quantities of information from these product catalogues, it should be possible to acquire product catalogues from multiple stores automatically and reliable, over a significant timespan and for multiple consecutive times.
In this thesis a web store product scraper software, capable of collecting product catalogue information from several web stores, was implemented. The software was implemented using JavaScript programming language, NodeJS framework, MongoDB NoSQL database and multiple well proven software development architectures. The web store product scraper was configured and tested with several different settings on three different sized web stores. The results were promising. From each store a significant amount of products were scraped. The amounts were also in line with the sizes of the stores. The stores were scraped concurrently and simultaneously without supervision and with low impact on system resources.
Collecting product information from online stores is possible and well proven, even though collecting information from large web stores takes time. The information can be scraped concurrently and simultaneously from multiple web stores. Future work should be more concentrated on building a framework around the web store product scrapers than to optimise the system resource consumption. The framework should simplify the configuration and monitoring of multiple simultaneous web store product scrapers.