Skip to content

Commit b00b72f

Browse files
authored
Merge pull request #1103 from steam-bell-92/main
[Docs]: Python Pandas Library Added
2 parents 5defeef + 5b7a3ad commit b00b72f

File tree

5 files changed

+252
-0
lines changed

5 files changed

+252
-0
lines changed

docs/Pandas/pd_data_analysis.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Basic Data Analysis
2+
3+
Once you have loaded your data into a DataFrame, Pandas offers simple and powerful methods for quickly exploring and summarizing your data, which is the core of any Data Science workflow.
4+
5+
1. **Inspecting the Data**
6+
7+
Before performing any analysis, you must first understand the structure and quality of your dataset.
8+
This step helps identify data types, missing values, and potential anomalies.
9+
10+
|Method|Description|
11+
|:-----|:----------|
12+
|`df.head()`|Displays the first n rows (default 5) for a quick look at the data.|
13+
|`df.tail()`|Displays the last n rows (default 5).|
14+
|`df.info()`|Shows column data types, non-null counts, and memory usage.|
15+
|`df.describe()`|Generates summary statistics for numeric columns.|
16+
|`df.shape`|Returns a tuple (rows, columns).|
17+
|`df.dtypes`|Displays data types of all columns.|
18+
19+
20+
2. **Handling Missing Data (NaN)**
21+
22+
Real-world data often has missing or incomplete entries.
23+
Handling them correctly is essential to avoid biased or invalid results.
24+
25+
|Method|Description|
26+
|:-----|:----------|
27+
|`df.isnull().sum()`|Counts missing (NaN) values per column.|
28+
|`df.dropna()`|Removes rows with missing values.|
29+
|`df.fillna(value)`|Fills missing values with a specific value.|
30+
|`df.fillna(df.mean())`|Fills missing values with the mean (for numeric columns).|
31+
32+
33+
3. **Data Selection and Filtering**
34+
35+
Once the data is clean, you often need to focus on specific rows or columns to analyze relevant subsets.
36+
37+
|Method|Description|
38+
|:-----|:----------|
39+
|`df['col']`|Selects a single column (returns a Series).|
40+
|`df[['col1','col2']]`|Selects multiple columns.|
41+
|`df.loc[row_labels, col_labels]`|Selects by label (rows and columns).|
42+
|`df.iloc[row_index, col_index]`|Selects by integer index position.|
43+
|`df[df['col'] > value]`|Filters rows based on a condition.|
44+
45+
46+
4. **Grouping and Aggregation**
47+
48+
After filtering, you often need to summarize or compare groups within your data.
49+
50+
|Method|Description|
51+
|:-----|:----------|
52+
|`df.groupby('col').agg()`|Groups data by the specified column, then applies an aggregate function (e.g., `mean()`, `sum()`, `count()`).|
53+
|`df.describe()`|Generates descriptive statistics (mean, std, min, max, etc.) for numerical columns.|
54+
|`df['col'].value_counts()`|Counts the frequency of unique values in a column.|
55+
56+
57+
5. **Data Transformation & Cleaning**
58+
59+
Data transformation involves reshaping, reformatting, or correcting data to make it more consistent and analysis-ready.
60+
61+
|Method|Description|
62+
|:-----|:----------|
63+
|`df.rename(columns={'old':'new'})`|Renames columns.|
64+
|`df.drop(columns=['col'])`|Removes one or more columns.|
65+
|`df.replace(old, new)`|Replaces specific values.|
66+
|`df.astype('type')`|Changes the data type of a column.|
67+
|`df.sort_values(by='col')`|Sorts rows by column values.|
68+
|`df.reset_index(drop=True)`|Resets the DataFrame index.|
69+
70+
71+
***Quick Statistics***
72+
73+
Once the data is ready, you can compute summary statistics to get insights about its distribution and relationships.
74+
75+
|Method|Description|
76+
|:-----|:----------|
77+
|`df.mean()`|Computes the mean (average) for numeric columns.|
78+
|`df.std()`|Computes the standard deviation for numeric columns.|
79+
|`df.min()`|Returns the minimum value for each column.|
80+
|`df.max()`|Returns the maximum value for each column.|
81+
|`df.median()`|Computes the median (50th percentile) for numeric columns.|
82+
|`df.corr()`|Computes pairwise correlation between numeric columns.|
83+
No newline at end of file

docs/Pandas/pd_dataframes.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Key Data Structures: Series and DataFrame
2+
3+
Pandas introduces two primary data structures: the Series and the DataFrame. Understanding these is crucial, as they form the basis of nearly all operations in the library.
4+
5+
## The Series (1D)
6+
7+
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc). You can think of a Series as a single column in a spreadsheet or a single vector in a dataset.
8+
9+
***Key components***:
10+
11+
**Data**: The actual values stored.
12+
13+
**Index** (Label): The labels used to access the data.
14+
15+
Creating a Series
16+
17+
```Python
18+
import pandas as pd
19+
20+
# Creating a Series from a list
21+
data = [10, 20, 30, 40]
22+
s = pd.Series(data, name='Example_Series')
23+
print(s)
24+
```
25+
26+
Output:
27+
```Python
28+
0 10 <-- Index (Default integer)
29+
1 20
30+
2 30
31+
3 40
32+
Name: Example_Series, dtype: int64
33+
```
34+
35+
## The DataFrame (2D)
36+
37+
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It is the most common object you will work with in Pandas and is analogous to a complete spreadsheet or a table in a database.
38+
39+
***Key components***:
40+
41+
**Data**: The actual values arranged in rows and columns.
42+
43+
**Rows Index**: Labels for each row.
44+
45+
**Column Index**: Labels for each column (the column names).
46+
47+
Creating a DataFrame
48+
The most common way to create a DataFrame is from a Python dictionary, where the keys become the column names.
49+
50+
```Python
51+
# Creating a DataFrame from a dictionary
52+
data = {
53+
'Name': ['Alice', 'Bob', 'Charlie'],
54+
'Age': [25, 30, 22],
55+
'City': ['New York', 'London', 'Paris']
56+
}
57+
58+
df = pd.DataFrame(data)
59+
60+
print(df)
61+
```
62+
63+
Output:
64+
```Python
65+
Name Age City
66+
0 Alice 25 New York <-- Row Index
67+
1 Bob 30 London
68+
2 Charlie 22 Paris
69+
^-- Column Names/Index
70+
```

docs/Pandas/pd_input_output.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Data Input/Output
2+
3+
One of the greatest strengths of Pandas is its ability to effortlessly read data into and write data out of a DataFrame from various file formats. This is achieved primarily through the functions prefixed with `pd.read_` and the methods prefixed with `df.to_`.
4+
5+
## Reading Data into a DataFrame
6+
To load data into a Pandas DataFrame, you use the appropriate `pd.read_...()` function. The most common input format is CSV.
7+
8+
|Function|File Type|Example Usage|
9+
|:-------|:--------|:------------|
10+
|`pd.read_csv()`|Comma-Separated Values (Text files)|`df = pd.read_csv('data.csv')`|
11+
|`pd.read_excel()`|Microsoft Excel files|`df = pd.read_excel('data.xlsx')`|
12+
|`pd.read_json()`|JavaScript Object Notation|`df = pd.read_json('data.json')`|
13+
|`pd.read_sql()`|SQL database tables|`df = pd.read_sql(query, connection)`|
14+
15+
16+
**Example**: Reading a CSV File
17+
18+
The `read_csv()` function is highly flexible, supporting parameters to handle delimiters, missing values, and specific column selection.
19+
20+
```Python
21+
# Load data from a CSV file into a DataFrame
22+
df_sales = pd.read_csv('sales_data.csv')
23+
```
24+
25+
## Writing Data from a DataFrame
26+
27+
After you've cleaned, transformed, or analyzed your data, you'll use a `.to_...()` method on the DataFrame object to save the results.
28+
29+
|Method|File Type|Example Usage|
30+
|:-----|:--------|:------------|
31+
|`df.to_csv()`|Comma-Separated Values|`df.to_csv('cleaned_data.csv', index=False)`|
32+
|`df.to_excel()`|Microsoft Excel files|`df.to_excel('analysis.xlsx', sheet_name='Summary')`|
33+
|`df.to_json()`|JavaScript Object Notation|`df.to_json('data_output.json')`|
34+
35+
36+
**Example**: Writing to a CSV File
37+
38+
When writing to a CSV, it is best practice to use `index=False` to prevent the DataFrame's row indices (the 0, 1, 2, ... numbers) from being saved as an unnecessary extra column in the file.
39+
40+
```Python
41+
# index=False ensures the row index is NOT included in the file
42+
df_sales.to_csv('processed_sales.csv', index=False)
43+
```

docs/Pandas/pd_intro.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Introduction to Pandas
2+
3+
## What is Pandas?
4+
Pandas is a powerful, open-source Python library essential for data analysis and data manipulation. It provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
5+
6+
At its core, Pandas is designed to make working with labeled and relational data (like data found in spreadsheets or SQL tables) both intuitive and fast. It is built on top of the NumPy library and is the standard tool used by data professionals for critical tasks such as:
7+
8+
- Data Cleaning: Handling missing data, filtering, and correcting errors.
9+
- Data Transformation: Grouping, merging, reshaping, and pivoting datasets.
10+
- Data Exploration: Calculating descriptive statistics and inspecting data structure.
11+
12+
### Installation and Setup 🛠️
13+
Pandas is not included in the standard Python library and must be installed separately.
14+
15+
1. **Installation**
16+
17+
Open your terminal or command prompt and run the following command:
18+
19+
```Bash
20+
pip install pandas
21+
```
22+
23+
If you are using the Anaconda distribution (common for data science), you can use the conda
24+
25+
```Bash
26+
conda install pandas
27+
```
28+
29+
2. **Importing & Verifying**
30+
31+
Once installed, you can begin using Pandas by importing it into your Python environment (script, Jupyter Notebook, etc.) using the widely accepted alias pd. It's also good practice to check the version you are using.
32+
33+
```Python
34+
import pandas as pd
35+
36+
# Check the version of Pandas installed
37+
print(pd.__version__)
38+
```
39+
40+
### Foundation and Ecosystem
41+
42+
It's helpful for users to know that Pandas is deeply integrated with the wider Python data science ecosystem:
43+
44+
- Built on NumPy: Internally, Pandas relies heavily on the NumPy library for fast array-based computation, which is why it performs complex operations so quickly.
45+
- Data Visualization: Pandas data structures work seamlessly with popular visualization libraries like Matplotlib and Seaborn.

sidebars.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,17 @@ const sidebars: SidebarsConfig = {
116116
},
117117
],
118118
},
119+
{
120+
type: "category",
121+
label: "Pandas",
122+
className: "custom-sidebar-pandas",
123+
items: [
124+
"Pandas/pd_intro",
125+
"Pandas/pd_dataframes",
126+
"Pandas/pd_input_output",
127+
"Pandas/pd_data_analysis",
128+
],
129+
},
119130
{
120131
type: "category",
121132
label: "🗄️ SQL",

0 commit comments

Comments
 (0)