-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathData exploration.txt
More file actions
64 lines (42 loc) · 1.73 KB
/
Data exploration.txt
File metadata and controls
64 lines (42 loc) · 1.73 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
Importing Libraries
To begin with data exploration, you will need to import the necessary libraries. The most commonly used libraries for data exploration are pandas and numpy. You can import these libraries using the following code:
import pandas as pd
import numpy as np
Loading Data
There are several ways to load data into a pandas DataFrame. Some common ways include:
Reading a CSV file:
df = pd.read_csv('data.csv')
Reading an Excel file:
df = pd.read_excel('data.xlsx')
Reading data from a database:
import sqlite3
# Connect to the database
conn = sqlite3.connect('database.db')
# Load the data into a DataFrame
df = pd.read_sql('SELECT * FROM data', conn)
Exploring Data
Once you have loaded the data into a pandas DataFrame, you can begin exploring the data. Here are some common tasks that you might perform during data exploration:
Viewing the first few rows of the DataFrame:
df.head()
Viewing the last few rows of the DataFrame:
df.tail()
Viewing the shape of the DataFrame (number of rows and columns):
df.shape
Viewing the data types of the columns:
# Viewing the summary statistics of the numerical columns
df.describe()
# Counting the number of unique values in a column
df['column'].nunique()
# Counting the number of occurrences of each value in a column
df['column'].value_counts()
# Finding the minimum and maximum values in a column
df['column'].min()
df['column'].max()
# Finding the mean, median, and standard deviation of a column
df['column'].mean()
df['column'].median()
df['column'].std()
# Plotting the distribution of a column
df['column'].plot(kind='hist')
# Plotting the relationship between two columns
df.plot(x='column_1', y='column_2', kind='scatter')