🔍🏠 Home Market Harvester Project

📋 Overview

The Home Market Harvester is a complete data system designed to gather -> purify -> analyze -> train model -> display information about the real estate market. It focuses on specific areas, comparing selected properties with the general market.

The system features an interactive dashboard that shows an overview of local market trends, compares selected properties with the market, and includes a map showing where the properties are located.

It gathers data from olx.pl and otodom.pl, which are websites listing properties in Poland.

The program runs on a personal computer and uses free, open-source tools along with two additional services for improving the data. These services provide location details through Nominatim and calculate travel times via openrouteservice. The dashboard is built with the streamlit framework, allowing it to be accessed via a local web address and shared with others.

📊 Data Visualization

🗂️ Project Structure

📚 Most important libraries

scraping:

Selenium: Interact with dynamically generated content and provide javascript interactions.
Beautiful Soup: Extracting the data from the HTML page-source.

cleaning:

NumPy: Support for large, multi-dimensional arrays and matrices.
pandas: Tools for reading, writing, and manipulating tabular data.
jupyter: Facilitates incremental code development, enabling users to write and execute code in manageable chunks, thereby enabling step-by-step data visualization, review, and iterative adjustments.

data enrichment:

Nominatim: Transforms addresses into geographic coordinates using OpenStreetMap data, enhancing the mapping and visualization of property listings.
openrouteservice: This API calculates routes and travel times using OpenStreetMap data, improving the accuracy of travel planning displayed on maps.

model developing:

scikit-learn: Library for machine learning that offers tools for data analysis and pattern detection. It includes efficient options like regression models, which are ideal for training quickly and accurately, even with small data sets.

data visualizing:

Streamlit: A framework that simplifies creating web apps for data analysis and machine learning, enabling developers to turn data scripts into interactive, public web applications with minimal coding.
matplotlib: Used for creating static, interactive, and animated visualizations in Python. Implemented for the bar charts and the map.
seaborn: Visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics, making data visualization both easier and more aesthetically pleasing.

🗜️ Pipeline Elements Breakdown

data: Houses both raw and processed datasets.
logs: Archives logs from the pipeline operations, such as scraping and system activity.
model: Stores machine learning models developed from the housing data.
notebooks: Contains Jupyter Notebooks for data analysis, cleaning, and model creation in the app development.
pipeline: The backbone of the project, encompassing scripts for scraping, cleaning, data model creation, and visualization.
.env: A key file for setting important variables needed for the pipeline to work properly.

Each stage of the pipeline (a_scraping, b_cleaning, c_model_developing, d_data_visualizing) is executed sequentially:

Scraping (a_scraping): Collects initial data from specific sources.
Cleaning (b_cleaning): Enhances data quality by removing errors and making it ready for analysis.
Model Developing (c_model_developing): Works on building and improving machine learning models.
Data Visualizing (d_data_visualizing): Displays data and insights through interactive dashboards.

Subdirectories such as orchestration and config help these processes by offering tools, helper functions, and configuration management for smooth pipeline operation.

📦 Requirements

Look at Pipfile

⚙️ Installation

To set up the project environment:

pip install pipenv
pipenv install
pipenv shell

🚨 Note: It's important to remember that the pipeline relies on external data sources, which may be subject to A/B tests, frontend changes, anti-bot activity, and server failures.

🔧 Configuration

Found in the pipeline/config directory, this setup makes it easier to manage API keys, file paths, and server settings:

Dynamic Naming with run_pipeline.conf: The MARKET_OFFERS_TIMEPLACE variable automatically names data storage directories using timestamps and locations, such as 2024_02_20_16_37_54_Mierzęcice__Będziński__Śląskie. This helps keep data organized and easy to find.
Security with .env File: Important details like API keys, USER_OFFERS_PATH, CHROME_DRIVER_PATH, and CHROME_BROWSER_PATH are stored here for better security.
It's essential to get and set up the required API key for openrouteservice and include paths like CHROME_DRIVER_PATH, CHROME_BROWSER_PATH, and USER_OFFERS_PATH in the .env file.

🔨 Usage

The app can be executed by running the run_pipeline.py script found within the pipeline directory.

python pipeline/run_pipeline.py --location_query "Location Name" --area_radius <radius in kilometers> --scraped_offers_cap <maximum number of offers> --destination_coords <latitude, longitute> --user_data_path <path to your data.csv>

For example, to collect up to 100 housing offers within 25 km of Warsaw at coordinates (52.203531, 21.047047), and compare them with your data stored at D:\path\user_data.csv (this step is optional), use the following command:

python pipeline/run_pipeline.py --location_query "Warszawa" --area_radius 25 --scraped_offers_cap 100 --destination_coords "52.203531, 21.047047" --user_data_path "D:\path\user_data.csv"

💻 Development

The notebooks directory includes Jupyter Notebooks that provide an interactive environment for developing and handling data. These notebooks are meant for development only, not for production.

The pipeline supports running each stage independently as a Python script, except for the d_data_visualizing stage. This stage uses the streamlit framework to produce interactive visualizations. For more details on this component, see the streamlit_README.

✅ Testing

The tests directory contains scripts that check the functionality and reliability of different parts of the pipeline. Right now, only the scraping phase has automated tests.

To execute the tests use the following command:

pipenv shell # at the root of the project
python -m unittest discover -s tests -p 'test_*.py'

💡 Lessons Learned

During the development, three significant insights were gained:

Preserving HTML Source Code for Data Integrity Due to the instability of web scraping sources, we save the HTML source code of each listing. This practice prevents data loss during processing and makes it easier to extract data if listings change. The small size of HTML files means they don’t take up much disk space or affect performance, making this method efficient and practical.
Executing Python Scripts Running Python scripts directly from .py files is more effective than converting Jupyter Notebooks to .py files and then running them. The latter often causes issues with library compatibility. Direct execution avoids these problems and ensures smoother development.
Codebase Structure Simplification
The project initially adopted a modular approach, with each step executed as a separate subprocess. This complexity hindered effective testing due to changing the behavior of the subprocesses in the unittest environment. It is more effective to integrate the codebase and use function calls within a single process for easier testing and maintenance.
Updating Environment Variables During Runtime
To prevent issues with environment variables not updating correctly, it is better to directly modify system files within the project.

📜 License

This project is licensed under the terms of the LICENSE file located in the project root.

Note: This README covers the overall project. For detailed information on specific components or stages, please see the README files in the respective stages directory.

Name		Name	Last commit message	Last commit date
Latest commit History 355 Commits
.vscode		.vscode
data		data
doc/images		doc/images
model/2024_02_25_20_30_26_Mierzęcice__Będziński__Śląskie		model/2024_02_25_20_30_26_Mierzęcice__Będziński__Śląskie
notebooks		notebooks
pipeline		pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
dev-requirements.txt		dev-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍🏠 Home Market Harvester Project

📋 Overview

📊 Data Visualization

🗂️ Project Structure

📚 Most important libraries

🗜️ Pipeline Elements Breakdown

📦 Requirements

⚙️ Installation

🔧 Configuration

🔨 Usage

💻 Development

✅ Testing

💡 Lessons Learned

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍🏠 Home Market Harvester Project

📋 Overview

📊 Data Visualization

🗂️ Project Structure

📚 Most important libraries

🗜️ Pipeline Elements Breakdown

📦 Requirements

⚙️ Installation

🔧 Configuration

🔨 Usage

💻 Development

✅ Testing

💡 Lessons Learned

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages