gperdrizet
diff --git a/‎.devcontainer/devcontainer.json‎
Lines changed: 11 additions & 3 deletions b/‎.devcontainer/devcontainer.json‎
Lines changed: 11 additions & 3 deletions
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 82 additions & 110 deletions b/‎README.md‎
Lines changed: 82 additions & 110 deletions
diff --git a/‎notebooks/MVP.ipynb‎
Lines changed: 7 additions & 3 deletions b/‎notebooks/MVP.ipynb‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎notebooks/full_solution/01_distributions.ipynb‎
Lines changed: 73 additions & 73 deletions b/‎notebooks/full_solution/01_distributions.ipynb‎
Lines changed: 73 additions & 73 deletions
@@ -1,16 +1,22 @@
 // For format details, see https://aka.ms/devcontainer.json. For config options, see the
 // README at: https://github.com/devcontainers/templates/tree/main/src/python
 {
+	// Container definition for a Python 3.11 development environment
 	"name": "Python 3.11",
 	"image": "mcr.microsoft.com/devcontainers/python:0-3.11",
-	"onCreateCommand": "sudo apt update && sudo apt upgrade -y && pip3 install --upgrade pip && pip3 install --user -r requirements.txt",
+
+	// Custom configuration options
 	"customizations": {
 	  "vscode": {
+
+		// Use 'settings' to set default VS code values on container create
 		"settings": {
 			"jupyter.kernels.excludePythonEnvironments": ["/usr/bin/python3"],
 			"remote.autoForwardPorts": false,
 			"remote.restoreForwardedPorts": false
 		},
+
+		// Add the IDs of VS code extensions you want to install here
 		"extensions": [
 		  "-dbaeumer.vscode-eslint",
 		  "ms-python.python",
@@ -20,7 +26,9 @@
 	  }
 	},
 
-	// Use 'postCreateCommand' to run commands after the container is created.
-	"postCreateCommand": "mkdir -p data",
+	// Use 'onCreateCommand' to run commands once when the container is created
+	"onCreateCommand": "sudo apt update && sudo apt upgrade -y && pip3 install --upgrade pip && pip3 install --user -r requirements.txt",
+
+	// Use 'postAttachCommand' to run commands each time a user connects to the container
 	"postAttachCommand": "htop"
 }
@@ -1,4 +1,5 @@
 __pycache__
-.venv
+.ipynb_checkpoints
 .vscode
+.venv
 data
@@ -1,138 +1,123 @@
 # AirBnB Data Analysis & Preprocessing Tutorial
 
-A comprehensive tutorial on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This repository contains both assignment materials for bootcamp students and complete solution notebooks demonstrating professional data science workflows.
+[![Codespaces Prebuilds](https://github.com/gperdrizet/gperdrizet-data-preprocessing-project-tutorial/actions/workflows/codespaces/create_codespaces_prebuilds/badge.svg)](https://github.com/gperdrizet/gperdrizet-data-preprocessing-project-tutorial/actions/workflows/codespaces/create_codespaces_prebuilds)
 
-## Overview
+A comprehensive data science project focused on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This project demonstrates essential data cleaning, exploration, and feature engineering techniques through practical exercises with real-world data.
 
-This project guides students through the complete data preprocessing pipeline using real-world AirBnB listing data from New York City. Students will learn essential data science skills including:
+## Project Overview
 
-- **Exploratory Data Analysis (EDA)**: Understanding data distributions, identifying patterns, and spotting anomalies
-- **Feature Relationships**: Analyzing correlations and interactions between different types of variables
-- **Data Cleaning**: Handling missing values, outliers, and data quality issues
-- **Feature Engineering**: Encoding categorical variables, scaling features, and creating synthetic features
-- **Statistical Testing**: Applying appropriate statistical methods for different data types
+This project analyzes **48,895 AirBnB listings** from New York City (2019) and provides hands-on experience with:
 
-## Repository Structure
+- Data loading and exploration
+- Statistical analysis and distribution analysis
+- Feature relationship investigation using appropriate statistical tests
+- Data cleaning and null value handling
+- Feature engineering and preprocessing
+- Advanced visualization techniques
 
-```
-├── notebooks/
-│   ├── MVP.ipynb                    # Assignment notebook for students
-│   ├── MVP_solution.ipynb           # Competed MVP
-│   └── instructions.md              # Detailed assignment instructions
-├── solution/
-│   ├── 01_distributions.ipynb       # EDA and feature distributions
-│   ├── 02_correlations.ipynb        # Feature relationships analysis
-│   ├── 03_data_cleaning.ipynb       # Data cleaning strategies
-│   ├── 04_feature_engineering.ipynb # Advanced preprocessing
-│   ├── functions.py                 # Helper functions
-├── data/
-│   └── processed/                   # Cleaned datasets 
-└── requirements.txt                 # Python dependencies
-```
-
-## Learning Objectives
-
-By completing this tutorial, students will learn to:
-
-1. **Analyze Data Distributions**
-   - Generate descriptive statistics for numerical and categorical features
-   - Create appropriate visualizations (histograms, bar plots, scatter plots)
-   - Identify data quality issues and extreme values
-
-2. **Investigate Feature Relationships**
-   - Apply Chi-squared tests for categorical-categorical relationships
-   - Use Kruskal-Wallis H-tests for categorical-numerical relationships
-   - Calculate Spearman/Kendall correlations for numerical-numerical relationships
-
-3. **Clean and Preprocess Data**
-   - Select relevant features for modeling
-   - Handle missing values using various imputation strategies
-   - Address extreme values and outliers appropriately
-
-4. **Engineer Features**
-   - Apply one-hot encoding to categorical variables
-   - Transform skewed distributions using Box-Cox transformation
-   - Create polynomial features to capture non-linear relationships
 
 ## Getting Started
 
 ### Option 1: GitHub Codespaces (Recommended)
 
-1. **Fork this repository** to your GitHub account:
-   - Click the "Fork" button in the top-right corner of this repository
-   - Select your GitHub account as the destination
+1. **Fork the Repository**
+   - Click the "Fork" button on the top right of the GitHub repository page
+   - Give the fork a descriptive name including your GitHub username
+   - Click "Create fork"
+   - Bookmark or save the link to your fork
 
-2. **Start a GitHub Codespace**:
-   - Go to your forked repository
-   - Click the green "Code" button
-   - Select the "Codespaces" tab
-   - Click "Create codespace on main"
-   - Wait for the environment to set up (2-3 minutes)
-
-3. **Install dependencies**:
-   ```bash
-   pip install -r requirements.txt
-   ```
+2. **Create a GitHub Codespace**
+   - On your forked repository, click the "Code" button
+   - Select "Create codespace on main"
+   - Wait for the environment to load (dependencies are pre-installed)
 
-4. **Start working**:
+3. **Start Working**
    - Open `notebooks/MVP.ipynb` to begin the assignment
    - Refer to `notebooks/instructions.md` for detailed requirements
-   - Check the `solution/` folder for complete examples
+   - Check the `full_solution/` folder for complete examples
 
-### Option 2: Local Setup
+### Option 2: Local Development
 
-1. **Clone your forked repository**:
+1. **Prerequisites**
+   - Git
+   - Python >= 3.10
+
+2. **Clone the repository**
    ```bash
    git clone https://github.com/YOUR_USERNAME/gperdrizet-data-preprocessing-project-tutorial.git
    cd gperdrizet-data-preprocessing-project-tutorial
    ```
 
-2. **Create a virtual environment** (recommended):
+3. **Set Up Environment**
    ```bash
    python -m venv venv
-   source venv/bin/activate  # On Windows: venv\Scripts\activate
-   ```
-
-3. **Install dependencies**:
-   ```bash
+   source venv/bin/activate
    pip install -r requirements.txt
    ```
 
-4. **Launch Jupyter Notebook**:
+4. **Launch Jupyter & start the notebook**
    ```bash
-   jupyter notebook
+   jupyter notebook notebooks/MVP.ipynb
    ```
 
+## Project Structure
+
+```
+├── .devcontainer/                       # Development container configuration
+├── notebooks/                           # Jupyter notebook directory
+│   ├── MVP.ipynb                        # Assignment notebook
+│   ├── MVP_solution.ipynb               # Solution notebook
+│   ├── instructions.md                  # Detailed assignment instructions
+│   └── full_solution/                   # Detailed solution notebooks
+│       ├── 01_distributions.ipynb
+│       ├── 02_correlations.ipynb
+│       ├── 03_data_cleaning.ipynb
+│       ├── 04_feature_engineering.ipynb
+│       └── functions.py
+│
+├── .gitignore                           # Files/directories not tracked by git
+├── requirements.txt                     # Python dependencies
+└── README.md                            # Project documentation
+```
+
+
 ## Dataset
 
-The project uses the **AirBnB NYC 2019** dataset containing 48,895 listings with the following features:
+The dataset contains **48,895 AirBnB listings** from New York City (2019) with the following key features:
+- **Price**: Property prices in USD
+- **Location**: Hierarchical location data (latitude, longitude, neighbourhood_group, neighbourhood)
+- **Listing Details**: room_type, minimum_nights, availability_365
+- **Host Information**: host_name, calculated_host_listings_count
+- **Review Data**: number_of_reviews, last_review, reviews_per_month
+- **Identifiers**: id, name
+
+**Note**: The dataset is automatically loaded from the web in the notebooks, so no manual download is required.
 
-- **Location**: `latitude`, `longitude`, `neighbourhood_group`, `neighbourhood`
-- **Listing Details**: `room_type`, `price`, `minimum_nights`, `availability_365`
-- **Host Information**: `host_name`, `calculated_host_listings_count`
-- **Review Data**: `number_of_reviews`, `last_review`, `reviews_per_month`
-- **Identifiers**: `id`, `name`
 
-The dataset is automatically loaded from the web in the notebooks, so no manual download is required.
+## Learning Objectives
+
+By completing this tutorial, students will learn to:
 
-## Assignment Instructions
+1. **Analyze Data Distributions**
+   - Generate descriptive statistics for numerical and categorical features
+   - Create appropriate visualizations (histograms, bar plots, scatter plots)
+   - Identify data quality issues and extreme values
 
-### For Students:
+2. **Investigate Feature Relationships**
+   - Apply Chi-squared tests for categorical-categorical relationships
+   - Use Kruskal-Wallis H-tests for categorical-numerical relationships
+   - Calculate Spearman/Kendall correlations for numerical-numerical relationships
 
-1. **Start with `notebooks/MVP.ipynb`** - This is your main assignment notebook
-2. **Read `notebooks/instructions.md`** - Contains detailed requirements and hints
-3. **Complete each section systematically**:
-   - Section 1: Analyze individual feature distributions
-   - Section 2: Investigate relationships between features
-   - Section 3: Clean and preprocess the data
-   - Section 4: Engineer new features
+3. **Clean and Preprocess Data**
+   - Select relevant features for modeling
+   - Handle missing values using various imputation strategies
+   - Address extreme values and outliers appropriately
 
-### Key Requirements:
+4. **Engineer Features**
+   - Apply one-hot encoding to categorical variables
+   - Transform skewed distributions using Box-Cox transformation
+   - Create polynomial features to capture non-linear relationships
 
-- **Appropriate statistical methods**: Use the right tests for different data types
-- **Clear visualizations**: Label axes, use appropriate scales, add titles
-- **Justified decisions**: Explain your choices for handling missing values and outliers
-- **Code documentation**: Comment your code and explain your reasoning
 
 ## Solution Reference
 
@@ -149,6 +134,7 @@ Each solution notebook includes:
 - Detailed interpretation of results
 - Best practices for data science workflows
 
+
 ## Key Technologies
 
 - **Python 3.8+**
@@ -159,21 +145,7 @@ Each solution notebook includes:
 - **SciPy**: Statistical testing
 - **Jupyter Notebook**: Interactive development environment
 
-## Tips for Success
-
-1. **Start Early**: This is a comprehensive assignment that requires time and thought
-2. **Read the Instructions**: The `instructions.md` file contains important hints and requirements
-3. **Understand the Data**: Spend time exploring the dataset before making decisions
-4. **Justify Your Choices**: Data cleaning decisions should be based on domain knowledge and analysis goals
-5. **Check the Solutions**: Use the solution notebooks as references, but try to solve problems independently first
-6. **Experiment**: Try different approaches and compare their effectiveness
 
 ## Contributing
 
-This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
-
----
-
-**Happy Learning!**
-
-This tutorial provides hands-on experience with real-world data science challenges. Take your time, experiment with different approaches, and don't be afraid to make mistakes - they're part of the learning process!
+This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
@@ -400,9 +400,11 @@
    "outputs": [],
    "source": [
     "# Quantify the strength of correlation between pairs of numerical features using Spearman or Kendall\n",
-    "# correlation coefficient. SciPy.stats has pairwise implementations for both: \n",
+    "# correlation coefficient. SciPy.stats has pairwise implementations for both:\n",
+    "#\n",
     "# spearmanr https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html\n",
     "# kendalltau https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html\n",
+    "#\n",
     "# Pandas `df.corr()`is another option to calculate a full cross-correlation matrix for a dataframe\n",
     "# in one call - but be careful with the defaults, they are not appropriate for this data!."
    ]
@@ -416,7 +418,9 @@
    "source": [
     "# Plot relationships between numerical features using a scatter plot with Matplotlib. Be sure to label axes \n",
     "# and/or plot and pick appropriate scales. Adding a best fit line can be nice, but is not super\n",
-    "# important for this data. Question: why not? Related: why am I suggesting non-parametric rank\n",
+    "# important for this data. \n",
+    "# \n",
+    "# Question: why not? Related: why am I suggesting non-parametric rank\n",
     "# based correlation coefficients above?"
    ]
   },
@@ -511,7 +515,7 @@
    "source": [
     "### 4.2. Feature scaling\n",
     "\n",
-    "Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation will improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
+    "Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation may improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
    ]
   },
   {
-Original file line number
+Diff line change
@@ @@ -1,4 +1,5 @@ @@
 __pycache__
 -.venv
 +.ipynb_checkpoints
 .vscode
 +.venv
 data