A robust tool for extracting detailed product reviews from Guardian Malaysia product pages. It transforms scattered customer feedback into clean, structured datasets for analysis, monitoring, and reporting. Ideal for teams needing reliable Guardian Malaysia review data at scale.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for guardianmy-reviews-spider you've just found your team — Let’s Chat. 👆👆
This project extracts product review information from Guardian Malaysia product pages and converts it into structured, analysis-ready data. It solves the problem of manually collecting and organizing customer reviews across multiple products. It is built for analysts, e-commerce teams, researchers, and product managers who rely on accurate review intelligence.
- Collects all available reviews directly from product pages
- Normalizes ratings, titles, content, and metadata into consistent fields
- Supports batch processing of multiple product URLs
- Enables longitudinal analysis using crawl and publication dates
- Designed for clean exports into analytics and BI workflows
| Feature | Description |
|---|---|
| Review Extraction | Scrapes complete review text, titles, and ratings from product pages. |
| Product Metadata | Captures product identifiers, names, segments, and categories. |
| Batch URL Support | Processes multiple Guardian Malaysia product URLs in one run. |
| Structured Output | Returns normalized, analysis-ready review objects. |
| Temporal Tagging | Includes review dates, crawl dates, and quarter labeling. |
| Field Name | Field Description |
|---|---|
| Product_Id | Unique identifier of the product being reviewed. |
| Review_Id | Unique identifier for each individual review. |
| Rating | Numerical rating score associated with the review. |
| Title | Review title or short summary text. |
| Body | Main textual content of the review. |
| Full_Review | Combined review text used for analysis. |
| Product_Name | Name of the reviewed product. |
| Product_Segment | High-level product category. |
| Product_Segment2 | Secondary product classification. |
| Gender | Intended gender segment of the product. |
| Country | Country associated with the product listing. |
| Date | Original publication date of the review. |
| Year_Quarter | Derived year and quarter label for trend analysis. |
| URL | Source product page URL. |
| Crawled_Date | Date when the review data was collected. |
[
{
"Product_Id": "121068601",
"Review_Id": "121068601-rev-1",
"Rating": 100,
"Title": "dove-shower-1l-beauty-nour-121068601",
"Body": "Best!",
"Sentiment": null,
"Section": "",
"Higher_Topic": null,
"Granular_Topic": null,
"Source": "Guardian",
"Full_Review": "Best!",
"Review_Type": "Product Review",
"Title_Trans": "",
"Body_Trans": "Best!",
"Full_Review_Trans": "Best!",
"Product_Name_Trans": "dove-shower-1l-beauty-nour-121068601",
"Product_Segment": "Supplements",
"Gender": "Unisex",
"Product_Segment2": "Wellness",
"Year_Quarter": "2024-Q2",
"Country": "Malaysia",
"Date": "06-04-2024",
"Product_Name": "dove-shower-1l-beauty-nour-121068601",
"Brand": null,
"URL": "https://www.guardian.com.my/dove-shower-1l-beauty-nour-121068601.html?page=1",
"Crawled_Date": "10-06-2025"
}
]
Guardianmy Reviews Spider/
├── src/
│ ├── main.py
│ ├── review_parser.py
│ ├── product_parser.py
│ ├── validators.py
│ └── utils.py
├── data/
│ ├── sample_input.json
│ └── sample_output.json
├── config/
│ └── settings.example.json
├── requirements.txt
└── README.md
- E-commerce teams use it to monitor customer feedback, so they can improve product positioning and listings.
- Market analysts use it to study review trends, so they can identify shifts in consumer sentiment.
- Brand managers use it to track product perception, so they can respond to recurring issues faster.
- Data scientists use it to build sentiment or rating models, so they can predict product performance.
How do I provide input URLs? You supply an array of Guardian Malaysia product page URLs, and the scraper processes each page sequentially.
Does it support multiple products at once? Yes, multiple product URLs can be processed in a single run for batch review extraction.
Are translations required for usage? No, translation fields are optional and can be ignored if not needed for your workflow.
Is the output suitable for analytics tools? Yes, the structured format is designed to integrate easily with databases, dashboards, and data pipelines.
Primary Metric: Processes dozens of product reviews per minute per product page on average.
Reliability Metric: Maintains a high success rate across paginated review pages with consistent field coverage.
Efficiency Metric: Optimized parsing minimizes redundant page processing and memory usage.
Quality Metric: Delivers high data completeness with consistent review-to-product mapping and timestamp accuracy.
