Data-science-cheatsheet/data science project.txt at main · Pratiikpy/Data-science-cheatsheet · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

## Project: Predicting House Prices

### Problem statement:

You have been hired by a real estate company to develop a model that can predict the sale price of houses in a given city. The company has provided a dataset of past home sales, including various features such as the size of the house, the number of bedrooms and bathrooms, and the location of the property. Your task is to build a model that can learn from this data and accurately predict the sale price of a new house based on its features.

### Data exploration:

First, you will need to explore the dataset to understand the characteristics of the data and identify any potential issues or trends. You can use techniques such as summary statistics, visualizations, and correlation analysis to gain insights into the data.

```python
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('housedata.csv')

# Print the first few rows
print(df.head())

# Calculate summary statistics
print(df.describe())

# Visualize the distribution of the target variable (sale price)
plt.hist(df['sale_price'])
plt.show()

# Visualize the relationship between the size of the house and the sale price
plt.scatter(df['size'], df['sale_price'])
plt.xlabel('Size (square feet)')
plt.ylabel('Sale price')
plt.show()
Data manipulation:
Next, you will need to clean and prepare the data for modeling. This may involve handling missing values, scaling features, and encoding categorical variables. You can use techniques such as imputation, normalization, and one-hot encoding to ensure that the data is in a suitable format for modeling.


# Import libraries
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Handle missing values
df = df.fillna(df.mean())

# Scale numerical features
scaler = StandardScaler()
df[['size', 'bedrooms', 'bathrooms']] = scaler.fit_transform(df[['size', 'bedrooms', 'bathrooms']])

# Encode categorical features
encoder = OneHotEncoder()
df = pd.concat([df, pd.DataFrame(encoder.fit_transform(df[['location']]).toarray())], axis=1)
df = df.drop(['location'], axis=1)
Feature engineering:
To improve the predictive power of your model, you may want to create new features or transform existing ones. This can include combining multiple features into a single one, applying mathematical transformations, or extracting useful information from text data.


# Create a new feature that represents the total number of rooms in the house
df['total_rooms'] = df['bedrooms'] + df['bathrooms']

# Transform the size feature using a log transformation
df['size'] = np.log(df['size'])
Feature selection:

### Feature Selection:

In many cases, you may have a large number of features in your dataset, but not all of them will be useful for building a model. By selecting the most relevant features, you can improve the performance of your model and reduce the risk of overfitting.

There are several methods for selecting features, including:

- **Filter methods**: These methods use statistical measures to select the most relevant features. Examples include using Pearson's correlation coefficient or ANOVA F-value.

- **Wrapper methods**: These methods use a model to select the most relevant features. Examples include using recursive feature elimination or forward selection.

- **Embedded methods**: These methods are embedded within the model itself and select features as part of the model training process. Examples include using Lasso regularization or tree-based feature selection.

```python
# Import libraries
from sklearn.feature_selection import SelectKBest, f_regression

# Define the features and target
X = df.drop(['sale_price'], axis=1)
y = df['sale_price']

# Select the best features using f-regression
selector = SelectKBest(f_regression, k=10)
X_selected = selector.fit_transform(X, y)

# Print the selected features
selected_features = X.columns[selector.get_support()]
print(selected_features)

### Modeling:

Now that you have cleaned and prepared the data, you can build and train a model to predict house prices. You can use techniques such as linear regression, decision trees, or support vector machines to build the model. You can also use cross-validation to evaluate the performance of your model and tune its hyperparameters.

```python
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Define the features and target
X = df.drop(['sale_price'], axis=1)
y = df['sale_price']

# Build and train the model
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5)
print(f'Cross-validation scores: {scores}')

# Evaluate the model on the test set
model.fit(X, y)
test_score = model.score(X_test, y_test)
print(f'Test score: {test_score}')

### Evaluation:

Finally, you will need to evaluate the performance of your model and communicate your findings to stakeholders. This can involve calculating evaluation metrics such as mean absolute error or root mean squared error, as well as creating visualizations to illustrate the results.

```python
# Import libraries
from sklearn.metrics import mean_absolute_error

# Predict the sale prices of the test set
y_pred = model.predict(X_test)

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean absolute error: {mae}')

# Visualize the predicted vs. actual sale prices
plt.scatter(y_test, y_pred)
plt.xlabel('Actual sale price')
plt.ylabel('Predicted sale price')
plt.show()
Putting it all together:
In this project, we have walked through the steps of a typical data science project, including data exploration, data manipulation, feature engineering, modeling, and evaluation. By applying the concepts and techniques covered in the Data Science Handbook, we were able