|
| 1 | +--- |
| 2 | +id: pandas-introduction |
| 3 | +title: Introduction to Pandas |
| 4 | +sidebar_label: Pandas |
| 5 | +description: Learn the basics of the Pandas Python library, including Series, DataFrame, data input/output, and basic data analysis, to kickstart your ML/DS workflow. |
| 6 | +sidebar_position: 20 |
| 7 | +tags: |
| 8 | + [ |
| 9 | + Python, |
| 10 | + Pandas, |
| 11 | + Data Analysis, |
| 12 | + DataFrame, |
| 13 | + Series, |
| 14 | + Python Library, |
| 15 | + Machine Learning, |
| 16 | + Data Science, |
| 17 | + Python Basics |
| 18 | + ] |
| 19 | +slug: /python/pandas-introduction |
| 20 | +--- |
| 21 | + |
| 22 | + |
| 23 | +# Introduction to Pandas |
| 24 | + |
| 25 | +Pandas is one of the most essential libraries in the Python data ecosystem. |
| 26 | +It provides rich, high-level data structures and tools designed for fast and flexible data manipulation, analysis, and visualization. |
| 27 | + |
| 28 | +If you're working in **data science**, **machine learning**, or **analytics**, Pandas is your foundation for cleaning, transforming, and understanding data. |
| 29 | +It sits beautifully on top of **NumPy**, integrating seamlessly with other libraries like **Matplotlib**, **Seaborn**, and **Scikit-learn**. |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## 1. Why Pandas? |
| 34 | + |
| 35 | +Working with raw data in Python used to mean juggling lists, dictionaries, and loops. |
| 36 | +Pandas simplifies all that by introducing *two powerful data structures* — the **Series** and the **DataFrame** — that behave much like spreadsheet tables or SQL tables. |
| 37 | + |
| 38 | +Some reasons Pandas is so popular: |
| 39 | + |
| 40 | +- Handles large datasets efficiently. |
| 41 | +- Provides built-in methods for aggregation, cleaning, and reshaping. |
| 42 | +- Easily reads and writes data from multiple sources like CSV, Excel, JSON, and SQL. |
| 43 | +- Integrates tightly with visualization and machine learning libraries. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 2. Installation |
| 48 | + |
| 49 | +If Pandas isn’t already installed, you can add it via pip: |
| 50 | + |
| 51 | +```bash |
| 52 | +pip install pandas |
| 53 | +``` |
| 54 | + |
| 55 | +You can also install it with Anaconda (which includes Pandas by default): |
| 56 | + |
| 57 | +```bash |
| 58 | +conda install pandas |
| 59 | +``` |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## 3. Core Data Structures |
| 64 | + |
| 65 | +### Series |
| 66 | + |
| 67 | +A **Series** is a one-dimensional labeled array. You can think of it as a single column in a spreadsheet. |
| 68 | + |
| 69 | +```python |
| 70 | +import pandas as pd |
| 71 | + |
| 72 | +# Create a simple Series |
| 73 | +s = pd.Series([100, 200, 300, 400]) |
| 74 | +print(s) |
| 75 | +``` |
| 76 | + |
| 77 | +**Output:** |
| 78 | +``` |
| 79 | +0 100 |
| 80 | +1 200 |
| 81 | +2 300 |
| 82 | +3 400 |
| 83 | +dtype: int64 |
| 84 | +``` |
| 85 | + |
| 86 | +Each element has an **index** (on the left) and a **value** (on the right). |
| 87 | +You can assign your own custom index too: |
| 88 | + |
| 89 | +```python |
| 90 | +s = pd.Series([10, 20, 30], index=['A', 'B', 'C']) |
| 91 | +print(s['B']) # Accessing by label → 20 |
| 92 | +``` |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +### DataFrame |
| 97 | + |
| 98 | +A **DataFrame** is a two-dimensional labeled data structure — essentially a table with rows and columns. |
| 99 | + |
| 100 | +```python |
| 101 | +data = { |
| 102 | + 'Name': ['Alice', 'Bob', 'Charlie', 'David'], |
| 103 | + 'Age': [25, 30, 35, 40], |
| 104 | + 'City': ['Delhi', 'Mumbai', 'Chennai', 'Kolkata'] |
| 105 | +} |
| 106 | + |
| 107 | +df = pd.DataFrame(data) |
| 108 | +print(df) |
| 109 | +``` |
| 110 | + |
| 111 | +**Output:** |
| 112 | +``` |
| 113 | + Name Age City |
| 114 | +0 Alice 25 Delhi |
| 115 | +1 Bob 30 Mumbai |
| 116 | +2 Charlie 35 Chennai |
| 117 | +3 David 40 Kolkata |
| 118 | +``` |
| 119 | + |
| 120 | +Each column in a DataFrame is actually a Series. |
| 121 | +You can access them individually: |
| 122 | + |
| 123 | +```python |
| 124 | +df['Name'] # Access a column |
| 125 | +df[['Name', 'Age']] # Access multiple columns |
| 126 | +df.loc[2] # Access a row by label |
| 127 | +df.iloc[0] # Access a row by position |
| 128 | +``` |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## 4. Reading and Writing Data (Input/Output) |
| 133 | + |
| 134 | +One of Pandas’ greatest strengths is its ability to easily load data from many file formats. |
| 135 | +Here are some commonly used functions: |
| 136 | + |
| 137 | +| Format | Read | Write | |
| 138 | +|:--------|:------|:------| |
| 139 | +| CSV | `pd.read_csv()` | `DataFrame.to_csv()` | |
| 140 | +| Excel | `pd.read_excel()` | `DataFrame.to_excel()` | |
| 141 | +| JSON | `pd.read_json()` | `DataFrame.to_json()` | |
| 142 | +| SQL | `pd.read_sql()` | `DataFrame.to_sql()` | |
| 143 | + |
| 144 | +### Example: CSV Files |
| 145 | + |
| 146 | +```python |
| 147 | +# Reading from a CSV file |
| 148 | +df = pd.read_csv('employees.csv') |
| 149 | + |
| 150 | +# Writing to a CSV file |
| 151 | +df.to_csv('employees_cleaned.csv', index=False) |
| 152 | +``` |
| 153 | + |
| 154 | +By default, Pandas assumes that the first row of your CSV file contains column names. |
| 155 | +You can customize this behavior with parameters like `header=None` or `names=[...]`. |
| 156 | + |
| 157 | +--- |
| 158 | + |
| 159 | +## 5. Basic Data Exploration |
| 160 | + |
| 161 | +Once your data is loaded into a DataFrame, Pandas provides a variety of methods for quick exploration. |
| 162 | + |
| 163 | +```python |
| 164 | +df.head() # Displays the first 5 rows |
| 165 | +df.tail() # Displays the last 5 rows |
| 166 | +df.shape # Returns (rows, columns) |
| 167 | +df.columns # Lists all column names |
| 168 | +df.dtypes # Shows data types for each column |
| 169 | +df.info() # Summary: column names, types, nulls, memory usage |
| 170 | +df.describe() # Statistical summary of numeric columns |
| 171 | +``` |
| 172 | + |
| 173 | +**Example:** |
| 174 | +```python |
| 175 | +print(df.describe()) |
| 176 | +``` |
| 177 | + |
| 178 | +**Output:** |
| 179 | +``` |
| 180 | + Age |
| 181 | +count 4.000000 |
| 182 | +mean 32.500000 |
| 183 | +std 6.454972 |
| 184 | +min 25.000000 |
| 185 | +25% 28.750000 |
| 186 | +50% 32.500000 |
| 187 | +75% 36.250000 |
| 188 | +max 40.000000 |
| 189 | +``` |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## 6. Data Selection and Filtering |
| 194 | + |
| 195 | +Pandas allows flexible data filtering using both labels and conditions. |
| 196 | + |
| 197 | +```python |
| 198 | +# Select a single column |
| 199 | +df['Age'] |
| 200 | + |
| 201 | +# Select multiple columns |
| 202 | +df[['Name', 'City']] |
| 203 | + |
| 204 | +# Conditional filtering |
| 205 | +df[df['Age'] > 30] |
| 206 | + |
| 207 | +# Combining multiple conditions |
| 208 | +df[(df['Age'] > 25) & (df['City'] == 'Delhi')] |
| 209 | +``` |
| 210 | + |
| 211 | +You can also use `.loc[]` for label-based selection or `.iloc[]` for position-based selection: |
| 212 | + |
| 213 | +```python |
| 214 | +df.loc[1:3, ['Name', 'City']] |
| 215 | +df.iloc[0:2, 0:2] |
| 216 | +``` |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## 7. Data Cleaning Basics |
| 221 | + |
| 222 | +Real-world data is messy. Pandas makes cleaning painless. |
| 223 | + |
| 224 | +### Handling Missing Values |
| 225 | + |
| 226 | +```python |
| 227 | +df.isnull() # Check for missing values |
| 228 | +df.dropna() # Drop rows with missing values |
| 229 | +df.fillna(0) # Fill missing values with a placeholder |
| 230 | +``` |
| 231 | + |
| 232 | +### Renaming Columns |
| 233 | + |
| 234 | +```python |
| 235 | +df.rename(columns={'Name': 'Employee_Name'}, inplace=True) |
| 236 | +``` |
| 237 | + |
| 238 | +### Changing Data Types |
| 239 | + |
| 240 | +```python |
| 241 | +df['Age'] = df['Age'].astype(float) |
| 242 | +``` |
| 243 | + |
| 244 | +--- |
| 245 | + |
| 246 | +## 8. Sorting and Grouping |
| 247 | + |
| 248 | +Sorting your data: |
| 249 | +```python |
| 250 | +df.sort_values(by='Age', ascending=False) |
| 251 | +``` |
| 252 | + |
| 253 | +Grouping (e.g., aggregating data by a category): |
| 254 | +```python |
| 255 | +grouped = df.groupby('City')['Age'].mean() |
| 256 | +print(grouped) |
| 257 | +``` |
| 258 | + |
| 259 | +**Output:** |
| 260 | +``` |
| 261 | +City |
| 262 | +Chennai 35.0 |
| 263 | +Delhi 25.0 |
| 264 | +Kolkata 40.0 |
| 265 | +Mumbai 30.0 |
| 266 | +Name: Age, dtype: float64 |
| 267 | +``` |
| 268 | + |
| 269 | +--- |
| 270 | + |
| 271 | +## 9. Basic Data Analysis |
| 272 | + |
| 273 | +Let’s see some quick examples of what you can do once your data is cleaned: |
| 274 | + |
| 275 | +```python |
| 276 | +# Mean age |
| 277 | +df['Age'].mean() |
| 278 | + |
| 279 | +# Count how many from each city |
| 280 | +df['City'].value_counts() |
| 281 | + |
| 282 | +# Filter and sort together |
| 283 | +df[df['Age'] > 30].sort_values(by='Age', ascending=False) |
| 284 | +``` |
| 285 | + |
| 286 | +--- |
| 287 | + |
| 288 | +## 10. Visualizing Data with Pandas |
| 289 | + |
| 290 | +Pandas integrates with **Matplotlib**, allowing quick visualization directly from your DataFrame. |
| 291 | + |
| 292 | +```python |
| 293 | +import matplotlib.pyplot as plt |
| 294 | + |
| 295 | +df['Age'].plot(kind='bar', title='Age Distribution') |
| 296 | +plt.xlabel('Index') |
| 297 | +plt.ylabel('Age') |
| 298 | +plt.show() |
| 299 | +``` |
| 300 | + |
| 301 | +For more advanced visualizations, you can use libraries like Seaborn or Plotly with your Pandas data. |
| 302 | + |
| 303 | +--- |
| 304 | + |
| 305 | +## 11. Summary |
| 306 | + |
| 307 | +Pandas provides a clean, efficient interface for everything from data cleaning to basic analysis. |
| 308 | +It’s one of the first libraries every data professional should master because it forms the backbone of nearly every ML and data science workflow in Python. |
| 309 | + |
| 310 | +**Next Steps:** |
| 311 | +- Explore advanced Pandas operations (merging, reshaping, pivoting) |
| 312 | +- Learn how Pandas integrates with NumPy and visualization libraries |
| 313 | +- Try using Pandas in a small data project — like analyzing a CSV dataset from Kaggle |
| 314 | + |
| 315 | +--- |
| 316 | + |
| 317 | +**References:** |
| 318 | +- [Official Pandas Documentation](https://pandas.pydata.org/) |
| 319 | +- [10 Minutes to Pandas (Official Guide)](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) |
| 320 | +- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) |
0 commit comments