From 003cd0774b7856db64cf0adc8208e2201c2d31e7 Mon Sep 17 00:00:00 2001 From: steam_bell_92 Date: Tue, 28 Oct 2025 14:22:47 +0530 Subject: [PATCH 1/4] Introduced Pandas in docs --- docs/Pandas/pd_data_analysis.md | 90 +++++++++++++++++++++++++++++++++ docs/Pandas/pd_dataframes.md | 70 +++++++++++++++++++++++++ docs/Pandas/pd_input_output.md | 45 +++++++++++++++++ docs/Pandas/pd_intro.md | 48 ++++++++++++++++++ sidebars.ts | 11 ++++ 5 files changed, 264 insertions(+) create mode 100644 docs/Pandas/pd_data_analysis.md create mode 100644 docs/Pandas/pd_dataframes.md create mode 100644 docs/Pandas/pd_input_output.md create mode 100644 docs/Pandas/pd_intro.md diff --git a/docs/Pandas/pd_data_analysis.md b/docs/Pandas/pd_data_analysis.md new file mode 100644 index 00000000..6980ec1d --- /dev/null +++ b/docs/Pandas/pd_data_analysis.md @@ -0,0 +1,90 @@ +# Basic Data Analysis + +Once you have loaded your data into a DataFrame, Pandas offers simple and powerful methods for quickly exploring and summarizing your data, which is the core of any Data Science workflow. + +1. **Inspecting the Data** + +Before performing any analysis, you must first understand the structure and quality of your dataset. +This step helps identify data types, missing values, and potential anomalies. + +```markdown +|Method|Description| +|:-----|:----------| +|`df.head()`|Displays the first n rows (default 5) for a quick look at the data.| +|`df.tail()`|Displays the last n rows (default 5).| +|`df.info()`|Shows column data types, non-null counts, and memory usage.| +|`df.describe()`|Generates summary statistics for numeric columns.| +|`df.shape`|Returns a tuple (rows, columns).| +|`df.dtypes`|Displays data types of all columns.| + +``` + +2. **Handling Missing Data (NaN)** + +Real-world data often has missing or incomplete entries. +Handling them correctly is essential to avoid biased or invalid results. + +```markdown +|Method|Description| +|:-----|:----------| +|`df.isnull().sum()`|Counts missing (NaN) values per column.| +|`df.dropna()`|Removes rows with missing values.| +|`df.fillna(value)`|Fills missing values with a specific value.| +|`df.fillna(df.mean())`|Fills missing values with the mean (for numeric columns).| +``` + +3. **Data Selection and Filtering** + +Once the data is clean, you often need to focus on specific rows or columns to analyze relevant subsets. + +```markdown +|Method|Description| +|:-----|:----------| +|`df['col']`|Selects a single column (returns a Series).| +|`df[['col1','col2']]`|Selects multiple columns.| +|`df.loc[row_labels, col_labels]`|Selects by label (rows and columns).| +|`df.iloc[row_index, col_index]`|Selects by integer index position.| +|`df[df['col'] > value]`|Filters rows based on a condition.| +``` + +4. **Grouping and Aggregation** + +After filtering, you often need to summarize or compare groups within your data. + +```markdown +|Method|Description| +|:-----|:----------| +|`df.groupby('col').agg()`|Groups data by the specified column, then applies an aggregate function (e.g., `mean()`, `sum()`, `count()`).| +|`df.describe()`|Generates descriptive statistics (mean, std, min, max, etc.) for numerical columns.| +|`df['col'].value_counts()`|Counts the frequency of unique values in a column.| +``` + +5. **Data Transformation & Cleaning** + +Data transformation involves reshaping, reformatting, or correcting data to make it more consistent and analysis-ready. + +```markdown +|Method|Description| +|:-----|:----------| +|`df.rename(columns={'old':'new'})`|Renames columns.| +|`df.drop(columns=['col'])`|Removes one or more columns.| +|`df.replace(old, new)`|Replaces specific values.| +|`df.astype('type')`|Changes the data type of a column.| +|`df.sort_values(by='col')`|Sorts rows by column values.| +|`df.reset_index(drop=True)`|Resets the DataFrame index.| +``` + +***Quick Statistics*** + +Once the data is ready, you can compute summary statistics to get insights about its distribution and relationships. + +```markdown +|Method|Description| +|:-----|:----------| +|`df.rename(columns={'old':'new'})`|Renames columns.| +|`df.drop(columns=['col'])`|Removes one or more columns.| +|`df.replace(old, new)`|Replaces specific values.| +|`df.astype('type')`|Changes the data type of a column.| +|`df.sort_values(by='col')`|Sorts rows by column values.| +|`df.reset_index(drop=True)`|Resets the DataFrame index.| +``` \ No newline at end of file diff --git a/docs/Pandas/pd_dataframes.md b/docs/Pandas/pd_dataframes.md new file mode 100644 index 00000000..b99ae10a --- /dev/null +++ b/docs/Pandas/pd_dataframes.md @@ -0,0 +1,70 @@ +## Key Data Structures: Series and DataFrame + +Pandas introduces two primary data structures: the Series and the DataFrame. Understanding these is crucial, as they form the basis of nearly all operations in the library. + +## The Series (1D) + +A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc). You can think of a Series as a single column in a spreadsheet or a single vector in a dataset. + +***Key components***: + +**Data**: The actual values stored. + +**Index** (Label): The labels used to access the data. + +Creating a Series + +```Python +import pandas as pd + +# Creating a Series from a list +data = [10, 20, 30, 40] +s = pd.Series(data, name='Example_Series') +print(s) +``` + +Output: +```Python +0 10 <-- Index (Default integer) +1 20 +2 30 +3 40 +Name: Example_Series, dtype: int64 +``` + +## The DataFrame (2D) + +A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It is the most common object you will work with in Pandas and is analogous to a complete spreadsheet or a table in a database. + +***Key components***: + +**Data**: The actual values arranged in rows and columns. + +**Rows Index**: Labels for each row. + +**Column Index**: Labels for each column (the column names). + +Creating a DataFrame +The most common way to create a DataFrame is from a Python dictionary, where the keys become the column names. + +```Python +# Creating a DataFrame from a dictionary +data = { + 'Name': ['Alice', 'Bob', 'Charlie'], + 'Age': [25, 30, 22], + 'City': ['New York', 'London', 'Paris'] +} + +df = pd.DataFrame(data) + +print(df) +``` + +Output: +```Python + Name Age City +0 Alice 25 New York <-- Row Index +1 Bob 30 London +2 Charlie 22 Paris +^-- Column Names/Index +``` \ No newline at end of file diff --git a/docs/Pandas/pd_input_output.md b/docs/Pandas/pd_input_output.md new file mode 100644 index 00000000..2069df16 --- /dev/null +++ b/docs/Pandas/pd_input_output.md @@ -0,0 +1,45 @@ +# Data Input/Output (I/O) + +One of the greatest strengths of Pandas is its ability to effortlessly read data into and write data out of a DataFrame from various file formats. This is achieved primarily through the functions prefixed with `pd.read_` and the methods prefixed with `df.to_`. + +## Reading Data into a DataFrame +To load data into a Pandas DataFrame, you use the appropriate `pd.read_...()` function. The most common input format is CSV. + +``` +|Function|File Type|Example Usage| +|:-------|:--------|:------------| +|`pd.read_csv()`|Comma-Separated Values (Text files)|`df = pd.read_csv('data.csv')`| +|`pd.read_excel()`|Microsoft Excel files|`df = pd.read_excel('data.xlsx')`| +|`pd.read_json()`|JavaScript Object Notation|`df = pd.read_json('data.json')`| +|`pd.read_sql()`|SQL database tables|`df = pd.read_sql(query, connection)`| +``` + +**Example**: Reading a CSV File + +The `read_csv()` function is highly flexible, supporting parameters to handle delimiters, missing values, and specific column selection. + +```Python +# Load data from a CSV file into a DataFrame +df_sales = pd.read_csv('sales_data.csv') +``` + +## Writing Data from a DataFrame + +After you've cleaned, transformed, or analyzed your data, you'll use a `.to_...()` method on the DataFrame object to save the results. + +``` +|Method|File Type|Example Usage| +|:-----|:--------|:------------| +|`df.to_csv()`|Comma-Separated Values|`df.to_csv('cleaned_data.csv', index=False)`| +|`df.to_excel()`|Microsoft Excel files|`df.to_excel('analysis.xlsx', sheet_name='Summary')`| +|`df.to_json()`|JavaScript Object Notation|`df.to_json('data_output.json')`| +``` + +**Example**: Writing to a CSV File + +When writing to a CSV, it is best practice to use `index=False` to prevent the DataFrame's row indices (the 0, 1, 2, ... numbers) from being saved as an unnecessary extra column in the file. + +```Python +# index=False ensures the row index is NOT included in the file +df_sales.to_csv('processed_sales.csv', index=False) +``` \ No newline at end of file diff --git a/docs/Pandas/pd_intro.md b/docs/Pandas/pd_intro.md new file mode 100644 index 00000000..f1f5c565 --- /dev/null +++ b/docs/Pandas/pd_intro.md @@ -0,0 +1,48 @@ +# Introduction to Pandas + +## What is Pandas? +Pandas is a powerful, open-source Python library essential for data analysis and data manipulation. It provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. + +At its core, Pandas is designed to make working with labeled and relational data (like data found in spreadsheets or SQL tables) both intuitive and fast. It is built on top of the NumPy library and is the standard tool used by data professionals for critical tasks such as: + +- Data Cleaning: Handling missing data, filtering, and correcting errors. + +- Data Transformation: Grouping, merging, reshaping, and pivoting datasets. + +- Data Exploration: Calculating descriptive statistics and inspecting data structure. + +### Installation and Setup 🛠️ +Pandas is not included in the standard Python library and must be installed separately. + +1. **Installation** + +Open your terminal or command prompt and run the following command: + +```Bash +pip install pandas +``` + +If you are using the Anaconda distribution (common for data science), you can use the conda + +```Bash +conda install pandas +``` + +2. **Importing & Verifying** + +Once installed, you can begin using Pandas by importing it into your Python environment (script, Jupyter Notebook, etc.) using the widely accepted alias pd. It's also good practice to check the version you are using. + +```Python +import pandas as pd + +# Check the version of Pandas installed +print(pd.__version__) +``` + +### Foundation and Ecosystem + +It's helpful for users to know that Pandas is deeply integrated with the wider Python data science ecosystem: + +- Built on NumPy: Internally, Pandas relies heavily on the NumPy library for fast array-based computation, which is why it performs complex operations so quickly. + +- Data Visualization: Pandas data structures work seamlessly with popular visualization libraries like Matplotlib and Seaborn. \ No newline at end of file diff --git a/sidebars.ts b/sidebars.ts index f7fb2ae4..a3b7254d 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -116,6 +116,17 @@ const sidebars: SidebarsConfig = { }, ], }, + { + type: "category", + Lable: "Pandas", + className: "custom-sidebar-pandas", + items: [ + "Pandas/pd_intro", + "Pandas/pd_dataframes", + "Pandas/pd_input_output", + "Pandas/pd_data_analysis", + ] + } { type: "category", label: "🗄️ SQL", From 2d44bba0c55158db8a79471f9293b75efe973549 Mon Sep 17 00:00:00 2001 From: steam_bell_92 Date: Tue, 28 Oct 2025 14:41:20 +0530 Subject: [PATCH 2/4] Corrected path/syntax in sidebars.ts --- sidebars.ts | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sidebars.ts b/sidebars.ts index a3b7254d..1e9b510c 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -118,15 +118,15 @@ const sidebars: SidebarsConfig = { }, { type: "category", - Lable: "Pandas", + label: "Pandas", className: "custom-sidebar-pandas", items: [ "Pandas/pd_intro", "Pandas/pd_dataframes", "Pandas/pd_input_output", "Pandas/pd_data_analysis", - ] - } + ], + }, { type: "category", label: "🗄️ SQL", From 39e7a0f291808ba665a172ee04ca39abd69e6ab2 Mon Sep 17 00:00:00 2001 From: steam_bell_92 Date: Tue, 28 Oct 2025 18:17:05 +0530 Subject: [PATCH 3/4] Corrects tables and indents --- docs/Pandas/pd_data_analysis.md | 18 +++++------------- docs/Pandas/pd_dataframes.md | 2 +- docs/Pandas/pd_input_output.md | 8 +++----- docs/Pandas/pd_intro.md | 3 --- 4 files changed, 9 insertions(+), 22 deletions(-) diff --git a/docs/Pandas/pd_data_analysis.md b/docs/Pandas/pd_data_analysis.md index 6980ec1d..41583e83 100644 --- a/docs/Pandas/pd_data_analysis.md +++ b/docs/Pandas/pd_data_analysis.md @@ -7,7 +7,6 @@ Once you have loaded your data into a DataFrame, Pandas offers simple and powerf Before performing any analysis, you must first understand the structure and quality of your dataset. This step helps identify data types, missing values, and potential anomalies. -```markdown |Method|Description| |:-----|:----------| |`df.head()`|Displays the first n rows (default 5) for a quick look at the data.| @@ -17,27 +16,24 @@ This step helps identify data types, missing values, and potential anomalies. |`df.shape`|Returns a tuple (rows, columns).| |`df.dtypes`|Displays data types of all columns.| -``` 2. **Handling Missing Data (NaN)** Real-world data often has missing or incomplete entries. Handling them correctly is essential to avoid biased or invalid results. -```markdown |Method|Description| |:-----|:----------| |`df.isnull().sum()`|Counts missing (NaN) values per column.| |`df.dropna()`|Removes rows with missing values.| |`df.fillna(value)`|Fills missing values with a specific value.| |`df.fillna(df.mean())`|Fills missing values with the mean (for numeric columns).| -``` + 3. **Data Selection and Filtering** Once the data is clean, you often need to focus on specific rows or columns to analyze relevant subsets. -```markdown |Method|Description| |:-----|:----------| |`df['col']`|Selects a single column (returns a Series).| @@ -45,25 +41,23 @@ Once the data is clean, you often need to focus on specific rows or columns to a |`df.loc[row_labels, col_labels]`|Selects by label (rows and columns).| |`df.iloc[row_index, col_index]`|Selects by integer index position.| |`df[df['col'] > value]`|Filters rows based on a condition.| -``` + 4. **Grouping and Aggregation** After filtering, you often need to summarize or compare groups within your data. -```markdown |Method|Description| |:-----|:----------| |`df.groupby('col').agg()`|Groups data by the specified column, then applies an aggregate function (e.g., `mean()`, `sum()`, `count()`).| |`df.describe()`|Generates descriptive statistics (mean, std, min, max, etc.) for numerical columns.| |`df['col'].value_counts()`|Counts the frequency of unique values in a column.| -``` + 5. **Data Transformation & Cleaning** Data transformation involves reshaping, reformatting, or correcting data to make it more consistent and analysis-ready. -```markdown |Method|Description| |:-----|:----------| |`df.rename(columns={'old':'new'})`|Renames columns.| @@ -72,13 +66,12 @@ Data transformation involves reshaping, reformatting, or correcting data to make |`df.astype('type')`|Changes the data type of a column.| |`df.sort_values(by='col')`|Sorts rows by column values.| |`df.reset_index(drop=True)`|Resets the DataFrame index.| -``` + ***Quick Statistics*** Once the data is ready, you can compute summary statistics to get insights about its distribution and relationships. -```markdown |Method|Description| |:-----|:----------| |`df.rename(columns={'old':'new'})`|Renames columns.| @@ -86,5 +79,4 @@ Once the data is ready, you can compute summary statistics to get insights about |`df.replace(old, new)`|Replaces specific values.| |`df.astype('type')`|Changes the data type of a column.| |`df.sort_values(by='col')`|Sorts rows by column values.| -|`df.reset_index(drop=True)`|Resets the DataFrame index.| -``` \ No newline at end of file +|`df.reset_index(drop=True)`|Resets the DataFrame index.| \ No newline at end of file diff --git a/docs/Pandas/pd_dataframes.md b/docs/Pandas/pd_dataframes.md index b99ae10a..78005166 100644 --- a/docs/Pandas/pd_dataframes.md +++ b/docs/Pandas/pd_dataframes.md @@ -1,4 +1,4 @@ -## Key Data Structures: Series and DataFrame +# Key Data Structures: Series and DataFrame Pandas introduces two primary data structures: the Series and the DataFrame. Understanding these is crucial, as they form the basis of nearly all operations in the library. diff --git a/docs/Pandas/pd_input_output.md b/docs/Pandas/pd_input_output.md index 2069df16..ac9ee152 100644 --- a/docs/Pandas/pd_input_output.md +++ b/docs/Pandas/pd_input_output.md @@ -1,18 +1,17 @@ -# Data Input/Output (I/O) +# Data Input/Output One of the greatest strengths of Pandas is its ability to effortlessly read data into and write data out of a DataFrame from various file formats. This is achieved primarily through the functions prefixed with `pd.read_` and the methods prefixed with `df.to_`. ## Reading Data into a DataFrame To load data into a Pandas DataFrame, you use the appropriate `pd.read_...()` function. The most common input format is CSV. -``` |Function|File Type|Example Usage| |:-------|:--------|:------------| |`pd.read_csv()`|Comma-Separated Values (Text files)|`df = pd.read_csv('data.csv')`| |`pd.read_excel()`|Microsoft Excel files|`df = pd.read_excel('data.xlsx')`| |`pd.read_json()`|JavaScript Object Notation|`df = pd.read_json('data.json')`| |`pd.read_sql()`|SQL database tables|`df = pd.read_sql(query, connection)`| -``` + **Example**: Reading a CSV File @@ -27,13 +26,12 @@ df_sales = pd.read_csv('sales_data.csv') After you've cleaned, transformed, or analyzed your data, you'll use a `.to_...()` method on the DataFrame object to save the results. -``` |Method|File Type|Example Usage| |:-----|:--------|:------------| |`df.to_csv()`|Comma-Separated Values|`df.to_csv('cleaned_data.csv', index=False)`| |`df.to_excel()`|Microsoft Excel files|`df.to_excel('analysis.xlsx', sheet_name='Summary')`| |`df.to_json()`|JavaScript Object Notation|`df.to_json('data_output.json')`| -``` + **Example**: Writing to a CSV File diff --git a/docs/Pandas/pd_intro.md b/docs/Pandas/pd_intro.md index f1f5c565..04af01fd 100644 --- a/docs/Pandas/pd_intro.md +++ b/docs/Pandas/pd_intro.md @@ -6,9 +6,7 @@ Pandas is a powerful, open-source Python library essential for data analysis and At its core, Pandas is designed to make working with labeled and relational data (like data found in spreadsheets or SQL tables) both intuitive and fast. It is built on top of the NumPy library and is the standard tool used by data professionals for critical tasks such as: - Data Cleaning: Handling missing data, filtering, and correcting errors. - - Data Transformation: Grouping, merging, reshaping, and pivoting datasets. - - Data Exploration: Calculating descriptive statistics and inspecting data structure. ### Installation and Setup 🛠️ @@ -44,5 +42,4 @@ print(pd.__version__) It's helpful for users to know that Pandas is deeply integrated with the wider Python data science ecosystem: - Built on NumPy: Internally, Pandas relies heavily on the NumPy library for fast array-based computation, which is why it performs complex operations so quickly. - - Data Visualization: Pandas data structures work seamlessly with popular visualization libraries like Matplotlib and Seaborn. \ No newline at end of file From 5b7a3ad04e4278b0ac369b39e8ce49f1625604ce Mon Sep 17 00:00:00 2001 From: Anuj Kulkarni Date: Tue, 28 Oct 2025 18:44:13 +0530 Subject: [PATCH 4/4] Update docs/Pandas/pd_data_analysis.md Co-authored-by: vercel[bot] <35613825+vercel[bot]@users.noreply.github.com> --- docs/Pandas/pd_data_analysis.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/Pandas/pd_data_analysis.md b/docs/Pandas/pd_data_analysis.md index 41583e83..a3fc0a43 100644 --- a/docs/Pandas/pd_data_analysis.md +++ b/docs/Pandas/pd_data_analysis.md @@ -74,9 +74,10 @@ Once the data is ready, you can compute summary statistics to get insights about |Method|Description| |:-----|:----------| -|`df.rename(columns={'old':'new'})`|Renames columns.| -|`df.drop(columns=['col'])`|Removes one or more columns.| -|`df.replace(old, new)`|Replaces specific values.| -|`df.astype('type')`|Changes the data type of a column.| -|`df.sort_values(by='col')`|Sorts rows by column values.| -|`df.reset_index(drop=True)`|Resets the DataFrame index.| \ No newline at end of file +|`df.mean()`|Computes the mean (average) for numeric columns.| +|`df.std()`|Computes the standard deviation for numeric columns.| +|`df.min()`|Returns the minimum value for each column.| +|`df.max()`|Returns the maximum value for each column.| +|`df.median()`|Computes the median (50th percentile) for numeric columns.| +|`df.corr()`|Computes pairwise correlation between numeric columns.| + No newline at end of file \ No newline at end of file