ClimateCrawlPredict A data mining and machine learning project that automatically collects weather data from China Weather Network (www.weather.com.cn) and applies the K-Means clustering algorithm to analyze and identify patterns or types of weather conditions.
📖 Overview This project is a end-to-end pipeline for weather data acquisition and analysis. It consists of two main components:
Scrapy Spider: A robust web crawler built with Scrapy to efficiently extract historical and forecasted weather data from China Weather Network.
K-Means Clustering: A machine learning module that processes the scraped data, performs preprocessing, and uses the K-Means algorithm to cluster the weather data points into distinct groups. This helps in identifying common weather patterns (e.g., hot & dry, cold & humid, mild & rainy) without prior labeling.
✨ Features Modular Scrapy Spider: Configurable to scrape data for specific cities and date ranges.
Structured Data Storage: Outputs cleaned data into CSV or JSON formats for easy analysis.
Data Preprocessing: Handles missing values, normalizes features, and prepares data for machine learning.
Unsupervised Learning: Utilizes K-Means to find inherent groupings in weather data.
Visualization: Includes scripts to generate plots (e.g., Elbow Method, PCA scatter plots) to visualize clusters and results.
🛠️ Tech Stack Web Scraping: Scrapy
Data Processing & Analysis: Pandas, NumPy
Machine Learning: Scikit-learn
Data Visualization: Matplotlib, Seaborn