Skip to content

Commit 0155ce2

Browse files
authored
Merge pull request #10 from gperdrizet/dev
Updated MVP solution
2 parents edec35e + 7df376c commit 0155ce2

File tree

9 files changed

+326
-341
lines changed

9 files changed

+326
-341
lines changed

.devcontainer/devcontainer.json

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
11
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
22
// README at: https://github.com/devcontainers/templates/tree/main/src/python
33
{
4+
// Container definition for a Python 3.11 development environment
45
"name": "Python 3.11",
56
"image": "mcr.microsoft.com/devcontainers/python:0-3.11",
6-
"onCreateCommand": "sudo apt update && sudo apt upgrade -y && pip3 install --upgrade pip && pip3 install --user -r requirements.txt",
7+
8+
// Custom configuration options
79
"customizations": {
810
"vscode": {
11+
12+
// Use 'settings' to set default VS code values on container create
913
"settings": {
1014
"jupyter.kernels.excludePythonEnvironments": ["/usr/bin/python3"],
1115
"remote.autoForwardPorts": false,
1216
"remote.restoreForwardedPorts": false
1317
},
18+
19+
// Add the IDs of VS code extensions you want to install here
1420
"extensions": [
1521
"-dbaeumer.vscode-eslint",
1622
"ms-python.python",
@@ -20,7 +26,9 @@
2026
}
2127
},
2228

23-
// Use 'postCreateCommand' to run commands after the container is created.
24-
"postCreateCommand": "mkdir -p data",
29+
// Use 'onCreateCommand' to run commands once when the container is created
30+
"onCreateCommand": "sudo apt update && sudo apt upgrade -y && pip3 install --upgrade pip && pip3 install --user -r requirements.txt",
31+
32+
// Use 'postAttachCommand' to run commands each time a user connects to the container
2533
"postAttachCommand": "htop"
2634
}

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
__pycache__
2-
.venv
2+
.ipynb_checkpoints
33
.vscode
4+
.venv
45
data

README.md

Lines changed: 82 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,138 +1,123 @@
11
# AirBnB Data Analysis & Preprocessing Tutorial
22

3-
A comprehensive tutorial on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This repository contains both assignment materials for bootcamp students and complete solution notebooks demonstrating professional data science workflows.
3+
[![Codespaces Prebuilds](https://github.com/gperdrizet/gperdrizet-data-preprocessing-project-tutorial/actions/workflows/codespaces/create_codespaces_prebuilds/badge.svg)](https://github.com/gperdrizet/gperdrizet-data-preprocessing-project-tutorial/actions/workflows/codespaces/create_codespaces_prebuilds)
44

5-
## Overview
5+
A comprehensive data science project focused on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This project demonstrates essential data cleaning, exploration, and feature engineering techniques through practical exercises with real-world data.
66

7-
This project guides students through the complete data preprocessing pipeline using real-world AirBnB listing data from New York City. Students will learn essential data science skills including:
7+
## Project Overview
88

9-
- **Exploratory Data Analysis (EDA)**: Understanding data distributions, identifying patterns, and spotting anomalies
10-
- **Feature Relationships**: Analyzing correlations and interactions between different types of variables
11-
- **Data Cleaning**: Handling missing values, outliers, and data quality issues
12-
- **Feature Engineering**: Encoding categorical variables, scaling features, and creating synthetic features
13-
- **Statistical Testing**: Applying appropriate statistical methods for different data types
9+
This project analyzes **48,895 AirBnB listings** from New York City (2019) and provides hands-on experience with:
1410

15-
## Repository Structure
11+
- Data loading and exploration
12+
- Statistical analysis and distribution analysis
13+
- Feature relationship investigation using appropriate statistical tests
14+
- Data cleaning and null value handling
15+
- Feature engineering and preprocessing
16+
- Advanced visualization techniques
1617

17-
```
18-
├── notebooks/
19-
│ ├── MVP.ipynb # Assignment notebook for students
20-
│ ├── MVP_solution.ipynb # Competed MVP
21-
│ └── instructions.md # Detailed assignment instructions
22-
├── solution/
23-
│ ├── 01_distributions.ipynb # EDA and feature distributions
24-
│ ├── 02_correlations.ipynb # Feature relationships analysis
25-
│ ├── 03_data_cleaning.ipynb # Data cleaning strategies
26-
│ ├── 04_feature_engineering.ipynb # Advanced preprocessing
27-
│ ├── functions.py # Helper functions
28-
├── data/
29-
│ └── processed/ # Cleaned datasets
30-
└── requirements.txt # Python dependencies
31-
```
32-
33-
## Learning Objectives
34-
35-
By completing this tutorial, students will learn to:
36-
37-
1. **Analyze Data Distributions**
38-
- Generate descriptive statistics for numerical and categorical features
39-
- Create appropriate visualizations (histograms, bar plots, scatter plots)
40-
- Identify data quality issues and extreme values
41-
42-
2. **Investigate Feature Relationships**
43-
- Apply Chi-squared tests for categorical-categorical relationships
44-
- Use Kruskal-Wallis H-tests for categorical-numerical relationships
45-
- Calculate Spearman/Kendall correlations for numerical-numerical relationships
46-
47-
3. **Clean and Preprocess Data**
48-
- Select relevant features for modeling
49-
- Handle missing values using various imputation strategies
50-
- Address extreme values and outliers appropriately
51-
52-
4. **Engineer Features**
53-
- Apply one-hot encoding to categorical variables
54-
- Transform skewed distributions using Box-Cox transformation
55-
- Create polynomial features to capture non-linear relationships
5618

5719
## Getting Started
5820

5921
### Option 1: GitHub Codespaces (Recommended)
6022

61-
1. **Fork this repository** to your GitHub account:
62-
- Click the "Fork" button in the top-right corner of this repository
63-
- Select your GitHub account as the destination
23+
1. **Fork the Repository**
24+
- Click the "Fork" button on the top right of the GitHub repository page
25+
- Give the fork a descriptive name including your GitHub username
26+
- Click "Create fork"
27+
- Bookmark or save the link to your fork
6428

65-
2. **Start a GitHub Codespace**:
66-
- Go to your forked repository
67-
- Click the green "Code" button
68-
- Select the "Codespaces" tab
69-
- Click "Create codespace on main"
70-
- Wait for the environment to set up (2-3 minutes)
71-
72-
3. **Install dependencies**:
73-
```bash
74-
pip install -r requirements.txt
75-
```
29+
2. **Create a GitHub Codespace**
30+
- On your forked repository, click the "Code" button
31+
- Select "Create codespace on main"
32+
- Wait for the environment to load (dependencies are pre-installed)
7633

77-
4. **Start working**:
34+
3. **Start Working**
7835
- Open `notebooks/MVP.ipynb` to begin the assignment
7936
- Refer to `notebooks/instructions.md` for detailed requirements
80-
- Check the `solution/` folder for complete examples
37+
- Check the `full_solution/` folder for complete examples
8138

82-
### Option 2: Local Setup
39+
### Option 2: Local Development
8340

84-
1. **Clone your forked repository**:
41+
1. **Prerequisites**
42+
- Git
43+
- Python >= 3.10
44+
45+
2. **Clone the repository**
8546
```bash
8647
git clone https://github.com/YOUR_USERNAME/gperdrizet-data-preprocessing-project-tutorial.git
8748
cd gperdrizet-data-preprocessing-project-tutorial
8849
```
8950

90-
2. **Create a virtual environment** (recommended):
51+
3. **Set Up Environment**
9152
```bash
9253
python -m venv venv
93-
source venv/bin/activate # On Windows: venv\Scripts\activate
94-
```
95-
96-
3. **Install dependencies**:
97-
```bash
54+
source venv/bin/activate
9855
pip install -r requirements.txt
9956
```
10057

101-
4. **Launch Jupyter Notebook**:
58+
4. **Launch Jupyter & start the notebook**
10259
```bash
103-
jupyter notebook
60+
jupyter notebook notebooks/MVP.ipynb
10461
```
10562

63+
## Project Structure
64+
65+
```
66+
├── .devcontainer/ # Development container configuration
67+
├── notebooks/ # Jupyter notebook directory
68+
│ ├── MVP.ipynb # Assignment notebook
69+
│ ├── MVP_solution.ipynb # Solution notebook
70+
│ ├── instructions.md # Detailed assignment instructions
71+
│ └── full_solution/ # Detailed solution notebooks
72+
│ ├── 01_distributions.ipynb
73+
│ ├── 02_correlations.ipynb
74+
│ ├── 03_data_cleaning.ipynb
75+
│ ├── 04_feature_engineering.ipynb
76+
│ └── functions.py
77+
78+
├── .gitignore # Files/directories not tracked by git
79+
├── requirements.txt # Python dependencies
80+
└── README.md # Project documentation
81+
```
82+
83+
10684
## Dataset
10785

108-
The project uses the **AirBnB NYC 2019** dataset containing 48,895 listings with the following features:
86+
The dataset contains **48,895 AirBnB listings** from New York City (2019) with the following key features:
87+
- **Price**: Property prices in USD
88+
- **Location**: Hierarchical location data (latitude, longitude, neighbourhood_group, neighbourhood)
89+
- **Listing Details**: room_type, minimum_nights, availability_365
90+
- **Host Information**: host_name, calculated_host_listings_count
91+
- **Review Data**: number_of_reviews, last_review, reviews_per_month
92+
- **Identifiers**: id, name
93+
94+
**Note**: The dataset is automatically loaded from the web in the notebooks, so no manual download is required.
10995

110-
- **Location**: `latitude`, `longitude`, `neighbourhood_group`, `neighbourhood`
111-
- **Listing Details**: `room_type`, `price`, `minimum_nights`, `availability_365`
112-
- **Host Information**: `host_name`, `calculated_host_listings_count`
113-
- **Review Data**: `number_of_reviews`, `last_review`, `reviews_per_month`
114-
- **Identifiers**: `id`, `name`
11596

116-
The dataset is automatically loaded from the web in the notebooks, so no manual download is required.
97+
## Learning Objectives
98+
99+
By completing this tutorial, students will learn to:
117100

118-
## Assignment Instructions
101+
1. **Analyze Data Distributions**
102+
- Generate descriptive statistics for numerical and categorical features
103+
- Create appropriate visualizations (histograms, bar plots, scatter plots)
104+
- Identify data quality issues and extreme values
119105

120-
### For Students:
106+
2. **Investigate Feature Relationships**
107+
- Apply Chi-squared tests for categorical-categorical relationships
108+
- Use Kruskal-Wallis H-tests for categorical-numerical relationships
109+
- Calculate Spearman/Kendall correlations for numerical-numerical relationships
121110

122-
1. **Start with `notebooks/MVP.ipynb`** - This is your main assignment notebook
123-
2. **Read `notebooks/instructions.md`** - Contains detailed requirements and hints
124-
3. **Complete each section systematically**:
125-
- Section 1: Analyze individual feature distributions
126-
- Section 2: Investigate relationships between features
127-
- Section 3: Clean and preprocess the data
128-
- Section 4: Engineer new features
111+
3. **Clean and Preprocess Data**
112+
- Select relevant features for modeling
113+
- Handle missing values using various imputation strategies
114+
- Address extreme values and outliers appropriately
129115

130-
### Key Requirements:
116+
4. **Engineer Features**
117+
- Apply one-hot encoding to categorical variables
118+
- Transform skewed distributions using Box-Cox transformation
119+
- Create polynomial features to capture non-linear relationships
131120

132-
- **Appropriate statistical methods**: Use the right tests for different data types
133-
- **Clear visualizations**: Label axes, use appropriate scales, add titles
134-
- **Justified decisions**: Explain your choices for handling missing values and outliers
135-
- **Code documentation**: Comment your code and explain your reasoning
136121

137122
## Solution Reference
138123

@@ -149,6 +134,7 @@ Each solution notebook includes:
149134
- Detailed interpretation of results
150135
- Best practices for data science workflows
151136

137+
152138
## Key Technologies
153139

154140
- **Python 3.8+**
@@ -159,21 +145,7 @@ Each solution notebook includes:
159145
- **SciPy**: Statistical testing
160146
- **Jupyter Notebook**: Interactive development environment
161147

162-
## Tips for Success
163-
164-
1. **Start Early**: This is a comprehensive assignment that requires time and thought
165-
2. **Read the Instructions**: The `instructions.md` file contains important hints and requirements
166-
3. **Understand the Data**: Spend time exploring the dataset before making decisions
167-
4. **Justify Your Choices**: Data cleaning decisions should be based on domain knowledge and analysis goals
168-
5. **Check the Solutions**: Use the solution notebooks as references, but try to solve problems independently first
169-
6. **Experiment**: Try different approaches and compare their effectiveness
170148

171149
## Contributing
172150

173-
This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
174-
175-
---
176-
177-
**Happy Learning!**
178-
179-
This tutorial provides hands-on experience with real-world data science challenges. Take your time, experiment with different approaches, and don't be afraid to make mistakes - they're part of the learning process!
151+
This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.

notebooks/MVP.ipynb

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -400,9 +400,11 @@
400400
"outputs": [],
401401
"source": [
402402
"# Quantify the strength of correlation between pairs of numerical features using Spearman or Kendall\n",
403-
"# correlation coefficient. SciPy.stats has pairwise implementations for both: \n",
403+
"# correlation coefficient. SciPy.stats has pairwise implementations for both:\n",
404+
"#\n",
404405
"# spearmanr https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html\n",
405406
"# kendalltau https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html\n",
407+
"#\n",
406408
"# Pandas `df.corr()`is another option to calculate a full cross-correlation matrix for a dataframe\n",
407409
"# in one call - but be careful with the defaults, they are not appropriate for this data!."
408410
]
@@ -416,7 +418,9 @@
416418
"source": [
417419
"# Plot relationships between numerical features using a scatter plot with Matplotlib. Be sure to label axes \n",
418420
"# and/or plot and pick appropriate scales. Adding a best fit line can be nice, but is not super\n",
419-
"# important for this data. Question: why not? Related: why am I suggesting non-parametric rank\n",
421+
"# important for this data. \n",
422+
"# \n",
423+
"# Question: why not? Related: why am I suggesting non-parametric rank\n",
420424
"# based correlation coefficients above?"
421425
]
422426
},
@@ -511,7 +515,7 @@
511515
"source": [
512516
"### 4.2. Feature scaling\n",
513517
"\n",
514-
"Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation will improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
518+
"Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation may improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
515519
]
516520
},
517521
{

notebooks/full_solution/01_distributions.ipynb

Lines changed: 73 additions & 73 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)