Skip to content

Commit 88528e6

Browse files
author
Sandesh
committed
docs: add comprehensive introduction to Pandas library for Python
1 parent 2e738ad commit 88528e6

File tree

1 file changed

+320
-0
lines changed

1 file changed

+320
-0
lines changed

docs/python/pandas-introduction.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
---
2+
id: pandas-introduction
3+
title: Introduction to Pandas
4+
sidebar_label: Pandas
5+
description: Learn the basics of the Pandas Python library, including Series, DataFrame, data input/output, and basic data analysis, to kickstart your ML/DS workflow.
6+
sidebar_position: 20
7+
tags:
8+
[
9+
Python,
10+
Pandas,
11+
Data Analysis,
12+
DataFrame,
13+
Series,
14+
Python Library,
15+
Machine Learning,
16+
Data Science,
17+
Python Basics
18+
]
19+
slug: /python/pandas-introduction
20+
---
21+
22+
23+
# Introduction to Pandas
24+
25+
Pandas is one of the most essential libraries in the Python data ecosystem.
26+
It provides rich, high-level data structures and tools designed for fast and flexible data manipulation, analysis, and visualization.
27+
28+
If you're working in **data science**, **machine learning**, or **analytics**, Pandas is your foundation for cleaning, transforming, and understanding data.
29+
It sits beautifully on top of **NumPy**, integrating seamlessly with other libraries like **Matplotlib**, **Seaborn**, and **Scikit-learn**.
30+
31+
---
32+
33+
## 1. Why Pandas?
34+
35+
Working with raw data in Python used to mean juggling lists, dictionaries, and loops.
36+
Pandas simplifies all that by introducing *two powerful data structures* — the **Series** and the **DataFrame** — that behave much like spreadsheet tables or SQL tables.
37+
38+
Some reasons Pandas is so popular:
39+
40+
- Handles large datasets efficiently.
41+
- Provides built-in methods for aggregation, cleaning, and reshaping.
42+
- Easily reads and writes data from multiple sources like CSV, Excel, JSON, and SQL.
43+
- Integrates tightly with visualization and machine learning libraries.
44+
45+
---
46+
47+
## 2. Installation
48+
49+
If Pandas isn’t already installed, you can add it via pip:
50+
51+
```bash
52+
pip install pandas
53+
```
54+
55+
You can also install it with Anaconda (which includes Pandas by default):
56+
57+
```bash
58+
conda install pandas
59+
```
60+
61+
---
62+
63+
## 3. Core Data Structures
64+
65+
### Series
66+
67+
A **Series** is a one-dimensional labeled array. You can think of it as a single column in a spreadsheet.
68+
69+
```python
70+
import pandas as pd
71+
72+
# Create a simple Series
73+
s = pd.Series([100, 200, 300, 400])
74+
print(s)
75+
```
76+
77+
**Output:**
78+
```
79+
0 100
80+
1 200
81+
2 300
82+
3 400
83+
dtype: int64
84+
```
85+
86+
Each element has an **index** (on the left) and a **value** (on the right).
87+
You can assign your own custom index too:
88+
89+
```python
90+
s = pd.Series([10, 20, 30], index=['A', 'B', 'C'])
91+
print(s['B']) # Accessing by label → 20
92+
```
93+
94+
---
95+
96+
### DataFrame
97+
98+
A **DataFrame** is a two-dimensional labeled data structure — essentially a table with rows and columns.
99+
100+
```python
101+
data = {
102+
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
103+
'Age': [25, 30, 35, 40],
104+
'City': ['Delhi', 'Mumbai', 'Chennai', 'Kolkata']
105+
}
106+
107+
df = pd.DataFrame(data)
108+
print(df)
109+
```
110+
111+
**Output:**
112+
```
113+
Name Age City
114+
0 Alice 25 Delhi
115+
1 Bob 30 Mumbai
116+
2 Charlie 35 Chennai
117+
3 David 40 Kolkata
118+
```
119+
120+
Each column in a DataFrame is actually a Series.
121+
You can access them individually:
122+
123+
```python
124+
df['Name'] # Access a column
125+
df[['Name', 'Age']] # Access multiple columns
126+
df.loc[2] # Access a row by label
127+
df.iloc[0] # Access a row by position
128+
```
129+
130+
---
131+
132+
## 4. Reading and Writing Data (Input/Output)
133+
134+
One of Pandas’ greatest strengths is its ability to easily load data from many file formats.
135+
Here are some commonly used functions:
136+
137+
| Format | Read | Write |
138+
|:--------|:------|:------|
139+
| CSV | `pd.read_csv()` | `DataFrame.to_csv()` |
140+
| Excel | `pd.read_excel()` | `DataFrame.to_excel()` |
141+
| JSON | `pd.read_json()` | `DataFrame.to_json()` |
142+
| SQL | `pd.read_sql()` | `DataFrame.to_sql()` |
143+
144+
### Example: CSV Files
145+
146+
```python
147+
# Reading from a CSV file
148+
df = pd.read_csv('employees.csv')
149+
150+
# Writing to a CSV file
151+
df.to_csv('employees_cleaned.csv', index=False)
152+
```
153+
154+
By default, Pandas assumes that the first row of your CSV file contains column names.
155+
You can customize this behavior with parameters like `header=None` or `names=[...]`.
156+
157+
---
158+
159+
## 5. Basic Data Exploration
160+
161+
Once your data is loaded into a DataFrame, Pandas provides a variety of methods for quick exploration.
162+
163+
```python
164+
df.head() # Displays the first 5 rows
165+
df.tail() # Displays the last 5 rows
166+
df.shape # Returns (rows, columns)
167+
df.columns # Lists all column names
168+
df.dtypes # Shows data types for each column
169+
df.info() # Summary: column names, types, nulls, memory usage
170+
df.describe() # Statistical summary of numeric columns
171+
```
172+
173+
**Example:**
174+
```python
175+
print(df.describe())
176+
```
177+
178+
**Output:**
179+
```
180+
Age
181+
count 4.000000
182+
mean 32.500000
183+
std 6.454972
184+
min 25.000000
185+
25% 28.750000
186+
50% 32.500000
187+
75% 36.250000
188+
max 40.000000
189+
```
190+
191+
---
192+
193+
## 6. Data Selection and Filtering
194+
195+
Pandas allows flexible data filtering using both labels and conditions.
196+
197+
```python
198+
# Select a single column
199+
df['Age']
200+
201+
# Select multiple columns
202+
df[['Name', 'City']]
203+
204+
# Conditional filtering
205+
df[df['Age'] > 30]
206+
207+
# Combining multiple conditions
208+
df[(df['Age'] > 25) & (df['City'] == 'Delhi')]
209+
```
210+
211+
You can also use `.loc[]` for label-based selection or `.iloc[]` for position-based selection:
212+
213+
```python
214+
df.loc[1:3, ['Name', 'City']]
215+
df.iloc[0:2, 0:2]
216+
```
217+
218+
---
219+
220+
## 7. Data Cleaning Basics
221+
222+
Real-world data is messy. Pandas makes cleaning painless.
223+
224+
### Handling Missing Values
225+
226+
```python
227+
df.isnull() # Check for missing values
228+
df.dropna() # Drop rows with missing values
229+
df.fillna(0) # Fill missing values with a placeholder
230+
```
231+
232+
### Renaming Columns
233+
234+
```python
235+
df.rename(columns={'Name': 'Employee_Name'}, inplace=True)
236+
```
237+
238+
### Changing Data Types
239+
240+
```python
241+
df['Age'] = df['Age'].astype(float)
242+
```
243+
244+
---
245+
246+
## 8. Sorting and Grouping
247+
248+
Sorting your data:
249+
```python
250+
df.sort_values(by='Age', ascending=False)
251+
```
252+
253+
Grouping (e.g., aggregating data by a category):
254+
```python
255+
grouped = df.groupby('City')['Age'].mean()
256+
print(grouped)
257+
```
258+
259+
**Output:**
260+
```
261+
City
262+
Chennai 35.0
263+
Delhi 25.0
264+
Kolkata 40.0
265+
Mumbai 30.0
266+
Name: Age, dtype: float64
267+
```
268+
269+
---
270+
271+
## 9. Basic Data Analysis
272+
273+
Let’s see some quick examples of what you can do once your data is cleaned:
274+
275+
```python
276+
# Mean age
277+
df['Age'].mean()
278+
279+
# Count how many from each city
280+
df['City'].value_counts()
281+
282+
# Filter and sort together
283+
df[df['Age'] > 30].sort_values(by='Age', ascending=False)
284+
```
285+
286+
---
287+
288+
## 10. Visualizing Data with Pandas
289+
290+
Pandas integrates with **Matplotlib**, allowing quick visualization directly from your DataFrame.
291+
292+
```python
293+
import matplotlib.pyplot as plt
294+
295+
df['Age'].plot(kind='bar', title='Age Distribution')
296+
plt.xlabel('Index')
297+
plt.ylabel('Age')
298+
plt.show()
299+
```
300+
301+
For more advanced visualizations, you can use libraries like Seaborn or Plotly with your Pandas data.
302+
303+
---
304+
305+
## 11. Summary
306+
307+
Pandas provides a clean, efficient interface for everything from data cleaning to basic analysis.
308+
It’s one of the first libraries every data professional should master because it forms the backbone of nearly every ML and data science workflow in Python.
309+
310+
**Next Steps:**
311+
- Explore advanced Pandas operations (merging, reshaping, pivoting)
312+
- Learn how Pandas integrates with NumPy and visualization libraries
313+
- Try using Pandas in a small data project — like analyzing a CSV dataset from Kaggle
314+
315+
---
316+
317+
**References:**
318+
- [Official Pandas Documentation](https://pandas.pydata.org/)
319+
- [10 Minutes to Pandas (Official Guide)](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
320+
- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)

0 commit comments

Comments
 (0)