|
| 1 | +# Basic Data Analysis |
| 2 | + |
| 3 | +Once you have loaded your data into a DataFrame, Pandas offers simple and powerful methods for quickly exploring and summarizing your data, which is the core of any Data Science workflow. |
| 4 | + |
| 5 | +1. **Inspecting the Data** |
| 6 | + |
| 7 | +Before performing any analysis, you must first understand the structure and quality of your dataset. |
| 8 | +This step helps identify data types, missing values, and potential anomalies. |
| 9 | + |
| 10 | +|Method|Description| |
| 11 | +|:-----|:----------| |
| 12 | +|`df.head()`|Displays the first n rows (default 5) for a quick look at the data.| |
| 13 | +|`df.tail()`|Displays the last n rows (default 5).| |
| 14 | +|`df.info()`|Shows column data types, non-null counts, and memory usage.| |
| 15 | +|`df.describe()`|Generates summary statistics for numeric columns.| |
| 16 | +|`df.shape`|Returns a tuple (rows, columns).| |
| 17 | +|`df.dtypes`|Displays data types of all columns.| |
| 18 | + |
| 19 | + |
| 20 | +2. **Handling Missing Data (NaN)** |
| 21 | + |
| 22 | +Real-world data often has missing or incomplete entries. |
| 23 | +Handling them correctly is essential to avoid biased or invalid results. |
| 24 | + |
| 25 | +|Method|Description| |
| 26 | +|:-----|:----------| |
| 27 | +|`df.isnull().sum()`|Counts missing (NaN) values per column.| |
| 28 | +|`df.dropna()`|Removes rows with missing values.| |
| 29 | +|`df.fillna(value)`|Fills missing values with a specific value.| |
| 30 | +|`df.fillna(df.mean())`|Fills missing values with the mean (for numeric columns).| |
| 31 | + |
| 32 | + |
| 33 | +3. **Data Selection and Filtering** |
| 34 | + |
| 35 | +Once the data is clean, you often need to focus on specific rows or columns to analyze relevant subsets. |
| 36 | + |
| 37 | +|Method|Description| |
| 38 | +|:-----|:----------| |
| 39 | +|`df['col']`|Selects a single column (returns a Series).| |
| 40 | +|`df[['col1','col2']]`|Selects multiple columns.| |
| 41 | +|`df.loc[row_labels, col_labels]`|Selects by label (rows and columns).| |
| 42 | +|`df.iloc[row_index, col_index]`|Selects by integer index position.| |
| 43 | +|`df[df['col'] > value]`|Filters rows based on a condition.| |
| 44 | + |
| 45 | + |
| 46 | +4. **Grouping and Aggregation** |
| 47 | + |
| 48 | +After filtering, you often need to summarize or compare groups within your data. |
| 49 | + |
| 50 | +|Method|Description| |
| 51 | +|:-----|:----------| |
| 52 | +|`df.groupby('col').agg()`|Groups data by the specified column, then applies an aggregate function (e.g., `mean()`, `sum()`, `count()`).| |
| 53 | +|`df.describe()`|Generates descriptive statistics (mean, std, min, max, etc.) for numerical columns.| |
| 54 | +|`df['col'].value_counts()`|Counts the frequency of unique values in a column.| |
| 55 | + |
| 56 | + |
| 57 | +5. **Data Transformation & Cleaning** |
| 58 | + |
| 59 | +Data transformation involves reshaping, reformatting, or correcting data to make it more consistent and analysis-ready. |
| 60 | + |
| 61 | +|Method|Description| |
| 62 | +|:-----|:----------| |
| 63 | +|`df.rename(columns={'old':'new'})`|Renames columns.| |
| 64 | +|`df.drop(columns=['col'])`|Removes one or more columns.| |
| 65 | +|`df.replace(old, new)`|Replaces specific values.| |
| 66 | +|`df.astype('type')`|Changes the data type of a column.| |
| 67 | +|`df.sort_values(by='col')`|Sorts rows by column values.| |
| 68 | +|`df.reset_index(drop=True)`|Resets the DataFrame index.| |
| 69 | + |
| 70 | + |
| 71 | +***Quick Statistics*** |
| 72 | + |
| 73 | +Once the data is ready, you can compute summary statistics to get insights about its distribution and relationships. |
| 74 | + |
| 75 | +|Method|Description| |
| 76 | +|:-----|:----------| |
| 77 | +|`df.mean()`|Computes the mean (average) for numeric columns.| |
| 78 | +|`df.std()`|Computes the standard deviation for numeric columns.| |
| 79 | +|`df.min()`|Returns the minimum value for each column.| |
| 80 | +|`df.max()`|Returns the maximum value for each column.| |
| 81 | +|`df.median()`|Computes the median (50th percentile) for numeric columns.| |
| 82 | +|`df.corr()`|Computes pairwise correlation between numeric columns.| |
| 83 | + No newline at end of file |
0 commit comments