| name | csv-data-summarizer |
|---|---|
| description | Analyzes CSV files and automatically generates comprehensive summaries with statistical insights, data quality checks, and visualizations using Python and pandas. No questions asked — just upload a CSV and get a full analysis immediately. |
This skill analyzes any CSV file and delivers a complete statistical summary with visualizations in one shot. It adapts intelligently to the type of data it finds — sales, customer, financial, operational, survey, or generic tabular data.
- User uploads or references a CSV file
- Asking to summarize, analyze, or visualize tabular data
- Requesting insights from a dataset
- Wanting to understand data structure and quality
Do not ask the user what they want. Immediately run the full analysis.
When a CSV is provided, skip questions like "What would you like me to do?" and go straight to the analysis.
pip install pandas matplotlib seabornThe skill inspects the data first, then automatically determines which analyses are relevant:
| Data type | Focus areas |
|---|---|
| Sales / e-commerce | Time-series trends, revenue, product performance |
| Customer data | Distributions, segmentation, geographic patterns |
| Financial | Trend analysis, statistics, correlations |
| Operational | Time-series, performance metrics, distributions |
| Survey | Frequency analysis, cross-tabulations |
| Generic | Adapts based on column types found |
Visualizations are only created when they make sense:
- Time-series plots → only if date/timestamp columns exist
- Correlation heatmaps → only if multiple numeric columns exist
- Category distributions → only if categorical columns exist
- Histograms → for numeric distributions when relevant
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def summarize_csv(file_path):
df = pd.read_csv(file_path)
summary = []
charts_created = []
# --- Overview ---
summary.append("=" * 60)
summary.append("DATA OVERVIEW")
summary.append("=" * 60)
summary.append(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
summary.append(f"\nColumns: {', '.join(df.columns.tolist())}")
summary.append("\nDATA TYPES:")
for col, dtype in df.dtypes.items():
summary.append(f" • {col}: {dtype}")
# --- Data quality ---
missing = df.isnull().sum().sum()
missing_pct = (missing / (df.shape[0] * df.shape[1])) * 100
summary.append("\nDATA QUALITY:")
if missing:
summary.append(f"Missing values: {missing:,} ({missing_pct:.2f}% of total data)")
for col in df.columns:
col_missing = df[col].isnull().sum()
if col_missing > 0:
summary.append(f" • {col}: {col_missing:,} ({(col_missing / len(df)) * 100:.1f}%)")
else:
summary.append("No missing values — dataset is complete.")
# --- Numeric analysis ---
numeric_cols = df.select_dtypes(include='number').columns.tolist()
if numeric_cols:
summary.append("\nNUMERICAL ANALYSIS:")
summary.append(str(df[numeric_cols].describe()))
if len(numeric_cols) > 1:
corr_matrix = df[numeric_cols].corr()
summary.append("\nCORRELATIONS:")
summary.append(str(corr_matrix))
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, linewidths=1)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.close()
charts_created.append('correlation_heatmap.png')
# --- Categorical analysis ---
categorical_cols = [c for c in df.select_dtypes(include='object').columns if 'id' not in c.lower()]
if categorical_cols:
summary.append("\nCATEGORICAL ANALYSIS:")
for col in categorical_cols[:5]:
value_counts = df[col].value_counts()
summary.append(f"\n{col}:")
for val, count in value_counts.head(10).items():
summary.append(f" • {val}: {count:,} ({(count / len(df)) * 100:.1f}%)")
# --- Time series analysis ---
date_cols = [c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()]
if date_cols:
date_col = date_cols[0]
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
date_range = df[date_col].max() - df[date_col].min()
summary.append(f"\nTIME SERIES ANALYSIS:")
summary.append(f"Date range: {df[date_col].min()} to {df[date_col].max()}")
summary.append(f"Span: {date_range.days} days")
if numeric_cols:
fig, axes = plt.subplots(min(3, len(numeric_cols)), 1, figsize=(12, 4 * min(3, len(numeric_cols))))
if len(numeric_cols) == 1:
axes = [axes]
for idx, num_col in enumerate(numeric_cols[:3]):
ax = axes[idx]
df.groupby(date_col)[num_col].mean().plot(ax=ax, linewidth=2)
ax.set_title(f'{num_col} Over Time')
ax.set_xlabel('Date')
ax.set_ylabel(num_col)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('time_series_analysis.png', dpi=150)
plt.close()
charts_created.append('time_series_analysis.png')
# --- Distribution plots ---
if numeric_cols:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for idx, col in enumerate(numeric_cols[:4]):
axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[idx].set_title(f'Distribution of {col}')
axes[idx].set_xlabel(col)
axes[idx].set_ylabel('Frequency')
axes[idx].grid(True, alpha=0.3)
for idx in range(len(numeric_cols[:4]), 4):
axes[idx].set_visible(False)
plt.tight_layout()
plt.savefig('distributions.png', dpi=150)
plt.close()
charts_created.append('distributions.png')
# --- Categorical distribution plots ---
if categorical_cols:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, col in enumerate(categorical_cols[:4]):
value_counts = df[col].value_counts().head(10)
axes[idx].barh(range(len(value_counts)), value_counts.values)
axes[idx].set_yticks(range(len(value_counts)))
axes[idx].set_yticklabels(value_counts.index)
axes[idx].set_title(f'Top Values in {col}')
axes[idx].set_xlabel('Count')
axes[idx].grid(True, alpha=0.3, axis='x')
for idx in range(len(categorical_cols[:4]), 4):
axes[idx].set_visible(False)
plt.tight_layout()
plt.savefig('categorical_distributions.png', dpi=150)
plt.close()
charts_created.append('categorical_distributions.png')
if charts_created:
summary.append("\nVISUALIZATIONS CREATED:")
for chart in charts_created:
summary.append(f" ✓ {chart}")
summary.append("\n" + "=" * 60)
summary.append("ANALYSIS COMPLETE")
summary.append("=" * 60)
return "\n".join(summary)Here's sales_data.csv. Can you summarize this file?
Analyze this customer data CSV and show me trends.
What insights can you find in orders.csv?
============================================================
DATA OVERVIEW
============================================================
Rows: 5,000 | Columns: 8
Columns: order_id, date, product, category, quantity, price, region, customer_id
DATA TYPES:
• order_id: int64
• date: object
• price: float64
...
DATA QUALITY:
Missing values: 100 (0.25% of total data)
• price: 100 (2.0%)
NUMERICAL ANALYSIS:
quantity price
count 5000.000 4900.000
mean 3.200 58.200
std 1.800 12.400
...
TIME SERIES ANALYSIS:
Date range: 2023-01-01 to 2023-12-31
Span: 364 days
VISUALIZATIONS CREATED:
✓ time_series_analysis.png
✓ distributions.png
✓ categorical_distributions.png
✓ correlation_heatmap.png
============================================================
ANALYSIS COMPLETE
============================================================
- Date columns are auto-detected if the column name contains
dateortime - Columns with
idin the name are excluded from categorical analysis - All charts are saved as PNG files in the working directory
- Missing data is handled gracefully throughout
json-and-csv-data-transformation— Clean and reshape CSV data before analysisdatabase-query-and-export— Export query results to CSV for analysisd3js-data-visualization— Build interactive browser-based charts from the same data