This repository contains two Python scripts for scraping data from Fragrantica.com: parser_links.py, which collects links to perfume pages, and parser_data.py, which extracts detailed information about perfumes from those links.
The project is organized as follows:
├── config/
│ └── config.yaml # Configuration file for the scripts
├── data/
│ └── fragrance_links.csv # CSV file storing the collected perfume links
│ └── fragrance_data.csv # CSV file storing the scraped data
├── logs/
│ └── error_log_data.log # Log file for errors during data parsing
│ └── error_log_links.log # Log file for errors during link parsing
├── src/
│ # (This directory is currently empty, but can be used to add other scripts)
├── .gitignore # Files ignored by git
├── environment.yml # Environment file for setting up the virtual env
├── parser_data.py # Script to collect detailed perfume data
├── parser_links.py # Script to collect perfume page links
└── README.md # This file
This script gathers links to perfume pages from Fragrantica.com based on gender and year of release. It navigates the website, extracts links, and saves them to data/fragrance_links.csv. It iterates through gender categories (male, female, unisex) and scrapes links for perfumes released each year from 2024 down to 1920. The script handles "Show more results" buttons, logs errors to logs/error_log_links.log, closes popup banners, uses a configurable timeout, randomly chooses a user agent to mimic real users, and removes duplicate links. The configuration, including website links, the number of elements to parse from a page, timeout, path for saving data, and a list of user agents, is loaded from config/config.yaml.
This script reads the perfume page links from data/fragrance_links.csv and extracts detailed information from each page on Fragrantica.com, saving the data to data/fragrance_data.csv. It extracts data such as the perfume title, main accords, votes, ratings, seasonality, fragrance notes, longevity, sillage, gender, and price-to-value ratings. The script logs errors to logs/error_log_data.log and closes popup banners. Configuration, such as the path to the links file and a list of user agents, is loaded from config/config.yaml.
-
Clone the repository:
git clone https://github.com/your_username/your_repository.git cd your_repository -
Create and activate a virtual environment using conda:
conda env create --file environment.yml conda activate web_scraper
or
mamba env create -f environment.yaml
This command will create a new conda environment named
web_scraperusing the dependencies specified inenvironment.yml. This file ensures that you have all the necessary Python packages at the correct versions to run the project. Theconda activate web_scrapercommand then activates this environment so that you are using these installed packages in your current shell. -
Install Playwright browsers:
playwright install
This command downloads the necessary browser binaries for Playwright to operate correctly.
-
Configure the
config/config.yamlfile: Adjust parameters such as website links, timeouts, and output paths to your needs.
To run the scripts:
-
Ensure you have activated the conda environment (
conda activate web_scraper). -
Run
parser_links.pyfirst:python parser_links.py
This will create the
data/fragrance_links.csvfile containing perfume links. -
Then, run
parser_data.py:python parser_data.py
This script will use the generated
fragrance_links.csvto collect data and save it todata/fragrance_data.csv.
Both scripts utilize Python's logging module to record errors during execution. Error messages are saved in:
logs/error_log_links.logforparser_links.py.logs/error_log_data.logforparser_data.py.