You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A comprehensive tutorial on exploratory data analysis (EDA) and datapreprocessing using real AirBnB NYC 2019 dataset. This repository contains both assignment materials for bootcamp students and complete solution notebooks demonstrating professional data science workflows.
A comprehensive data science project focused on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This project demonstrates essential data cleaning, exploration, and feature engineering techniques through practical exercises with real-world data.
6
6
7
-
This project guides students through the complete data preprocessing pipeline using real-world AirBnB listing data from New York City. Students will learn essential data science skills including:
7
+
## Project Overview
8
8
9
-
-**Exploratory Data Analysis (EDA)**: Understanding data distributions, identifying patterns, and spotting anomalies
10
-
-**Feature Relationships**: Analyzing correlations and interactions between different types of variables
11
-
-**Data Cleaning**: Handling missing values, outliers, and data quality issues
12
-
-**Feature Engineering**: Encoding categorical variables, scaling features, and creating synthetic features
13
-
-**Statistical Testing**: Applying appropriate statistical methods for different data types
9
+
This project analyzes **48,895 AirBnB listings** from New York City (2019) and provides hands-on experience with:
14
10
15
-
## Repository Structure
11
+
- Data loading and exploration
12
+
- Statistical analysis and distribution analysis
13
+
- Feature relationship investigation using appropriate statistical tests
14
+
- Data cleaning and null value handling
15
+
- Feature engineering and preprocessing
16
+
- Advanced visualization techniques
16
17
17
-
```
18
-
├── notebooks/
19
-
│ ├── MVP.ipynb # Assignment notebook for students
- Section 2: Investigate relationships between features
127
-
- Section 3: Clean and preprocess the data
128
-
- Section 4: Engineer new features
111
+
3.**Clean and Preprocess Data**
112
+
- Select relevant features for modeling
113
+
- Handle missing values using various imputation strategies
114
+
- Address extreme values and outliers appropriately
129
115
130
-
### Key Requirements:
116
+
4.**Engineer Features**
117
+
- Apply one-hot encoding to categorical variables
118
+
- Transform skewed distributions using Box-Cox transformation
119
+
- Create polynomial features to capture non-linear relationships
131
120
132
-
-**Appropriate statistical methods**: Use the right tests for different data types
133
-
-**Clear visualizations**: Label axes, use appropriate scales, add titles
134
-
-**Justified decisions**: Explain your choices for handling missing values and outliers
135
-
-**Code documentation**: Comment your code and explain your reasoning
136
121
137
122
## Solution Reference
138
123
@@ -149,6 +134,7 @@ Each solution notebook includes:
149
134
- Detailed interpretation of results
150
135
- Best practices for data science workflows
151
136
137
+
152
138
## Key Technologies
153
139
154
140
-**Python 3.8+**
@@ -159,21 +145,7 @@ Each solution notebook includes:
159
145
-**SciPy**: Statistical testing
160
146
-**Jupyter Notebook**: Interactive development environment
161
147
162
-
## Tips for Success
163
-
164
-
1.**Start Early**: This is a comprehensive assignment that requires time and thought
165
-
2.**Read the Instructions**: The `instructions.md` file contains important hints and requirements
166
-
3.**Understand the Data**: Spend time exploring the dataset before making decisions
167
-
4.**Justify Your Choices**: Data cleaning decisions should be based on domain knowledge and analysis goals
168
-
5.**Check the Solutions**: Use the solution notebooks as references, but try to solve problems independently first
169
-
6.**Experiment**: Try different approaches and compare their effectiveness
170
148
171
149
## Contributing
172
150
173
-
This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
174
-
175
-
---
176
-
177
-
**Happy Learning!**
178
-
179
-
This tutorial provides hands-on experience with real-world data science challenges. Take your time, experiment with different approaches, and don't be afraid to make mistakes - they're part of the learning process!
151
+
This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
"# Pandas `df.corr()`is another option to calculate a full cross-correlation matrix for a dataframe\n",
407
409
"# in one call - but be careful with the defaults, they are not appropriate for this data!."
408
410
]
@@ -416,7 +418,9 @@
416
418
"source": [
417
419
"# Plot relationships between numerical features using a scatter plot with Matplotlib. Be sure to label axes \n",
418
420
"# and/or plot and pick appropriate scales. Adding a best fit line can be nice, but is not super\n",
419
-
"# important for this data. Question: why not? Related: why am I suggesting non-parametric rank\n",
421
+
"# important for this data. \n",
422
+
"# \n",
423
+
"# Question: why not? Related: why am I suggesting non-parametric rank\n",
420
424
"# based correlation coefficients above?"
421
425
]
422
426
},
@@ -511,7 +515,7 @@
511
515
"source": [
512
516
"### 4.2. Feature scaling\n",
513
517
"\n",
514
-
"Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation will improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
518
+
"Model performance can often be improved by transforming and/or scaling the features and labels, but this depends on the model type. The features and labels in this dataset are not normally distributed, so a Box-Cox transformation may improve the performance of many common model types (including linear and logistic regression). See [`sklearn.preprocessing.PowerTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)."
0 commit comments