Effortlessly scrape the web using just a few keywords!
Save the existing PDFs in a separate folder and convert the data stored in the MongoDB database into PDFs or json.
This guide walks you through setting up a web scraping environment using Scrapy and MongoDB on Ubuntu/Debian Linux systems. With just a few keywords, you'll be able to scrape the web and store the results in a MongoDB database.
python3 -m venv scraper
source scraper/bin/activatepip install -r requirements.txtwget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.listsudo apt-get updatesudo apt-get install -y mongodb-orgmkdir -p ~/path/to/your/project/data/dbmongod --dbpath ~/path/to/your/project/data/dbIf you are using WSL, you may want to specify a different port or use configuration files:
mongod --dbpath ~/data/db --bind_ip 0.0.0.0 --port 27017Open a new terminal window and connect to the MongoDB server:
mongoIf you encounter an error, run:
sudo apt update
sudo apt install mongodb-clientscd webcrawler
scrapy crawl spider -a keywords="climate change" # You can add whatever keyword you want to scrape for