Tutorial for users to get up and running with the AutoCSV Profiler Suite.
- Before You Begin
- Tutorial 1: First Analysis
- Tutorial 2: Understanding Output Reports
- Tutorial 3: Comparing Different Engines
- Tutorial 4: Handling Large Files
- Tutorial 5: Troubleshooting Common Issues
- Next Steps
Complete the Installation Guide before starting this tutorial.
Verify setup:
conda env list | grep csv-profiler
# Should show: csv-profiler-main, csv-profiler-profiling, csv-profiler-dataprepFor this tutorial, we'll create a simple test dataset:
# Create a sample CSV file
echo "name,age,city,salary,department" > tutorial_data.csv
echo "Alice,25,New York,50000,Engineering" >> tutorial_data.csv
echo "Bob,30,San Francisco,75000,Engineering" >> tutorial_data.csv
echo "Carol,35,Chicago,60000,Marketing" >> tutorial_data.csv
echo "David,28,Austin,55000,Sales" >> tutorial_data.csv
echo "Eve,32,Seattle,70000,Engineering" >> tutorial_data.csvSetup process completed.
Run CSV analysis using the interactive mode.
python bin/run_analysis.pyThe welcome screen displays:
When prompted, enter the path to the test file:
Step 1: File Selection
Select CSV file for analysis
Enter the path to the CSV file: tutorial_data.csv
The system will validate the file and show its size:
Setting up output directory for: tutorial_data.csv
Output will be saved to: tutorial_data_analysis_20240101_143022
The system automatically detects the delimiter:
Step 2: Delimiter Detection
Analyzing file structure to determine the best delimiter
Analyzing sample data...
Detected delimiter: ',' (confidence: 0.95)
If detection fails, manually specify the delimiter when prompted.
Available engines display:
For first analysis, select engines 1,2 (main and YData):
Watch the progress as each engine runs:
Summary of generated files:
Analysis Results Summary
✓ main/analyzer - Completed successfully
Generated: dataset_analysis.txt, numerical_summary.csv, categorical_summary.csv
✓ profiling/ydata_report - Completed successfully
Generated: ydata_profiling_report.html
Analysis completed in: 12.3 seconds
Output directory: /path/to/tutorial_data_analysis_20240101_143022
Initial analysis completed. The output directory contains:
- Text reports from the main engine
- HTML report from YData Profiling
- Visualizations and additional analysis files
Let's explore what each engine produces and how to interpret the results.
Navigate to the output directory and examine the main engine results:
- Dataset structure and size
- Data types and quality metrics
- Missing value assessment
Open this CSV file to see statistical summaries:
- Central tendencies (mean, median)
- Spread (standard deviation, quartiles)
- Range (min, max values)
- Unique value counts
- Most frequent categories
- Category distribution patterns
Open the ydata_profiling_report.html file in a web browser.
-
Overview Tab
- Dataset summary statistics
- Variable types breakdown
- Warnings about data quality issues
-
Variables Tab
- Detailed analysis of each column
- Distribution plots and histograms
- Missing value patterns
- Unique value counts
-
Interactions Tab
- Correlation matrix between variables
- Scatter plots for numerical variables
- Association matrices
-
Correlations Tab
- Pearson, Spearman, and other correlation measures
- Correlation heatmaps
- Strong correlation identification
-
Missing Values Tab
- Missing value matrix
- Missing value heatmaps
- Patterns in missing data
-
Sample Tab
- First and last few rows of data
- Random sample view
For our tutorial dataset:
- Data Quality: All fields complete (no missing values)
- Numerical Variables:
- Age ranges from 25-35 (young workforce)
- Salary ranges from $50K-75K (consistent with age)
- Categorical Variables:
- Engineering is the dominant department (3/5 employees)
- All employees in different cities (distributed workforce)
- Correlations: Age and salary show positive correlation (older = higher salary)
Each engine has strengths for different use cases. Let's run a comparison.
Start a new analysis with all engines:
python bin/run_analysis.py tutorial_data.csvWhen prompted for engine selection, press Enter to select all engines.
After completion, files include:
| Engine | Output Type | Best For | Processing Time |
|---|---|---|---|
| Main | Text/CSV files | Statistical analysis, custom metrics | Fast |
| YData | Interactive HTML | Data profiling, data quality | Slow |
| SweetViz | Interactive HTML | Quick overview, presentations | Fast |
| DataPrep | Interactive HTML | EDA, distribution analysis | Medium |
Use Main Engine when:
- Detailed statistical analysis needed
- Working with large datasets
- Integrating with other data pipelines
- Need CSV output for further processing
Use YData Profiling when:
- Performing data exploration
- Need data quality assessment
- Want correlation analysis
- Require reports
Use SweetViz when:
- Need quick data overview
- Creating presentation materials
- Comparing datasets
- Time is limited
Use DataPrep when:
- Focus on exploratory data analysis
- Need distribution visualizations
- Working with legacy systems
- Want balanced features and speed
Try running the same dataset through different engines and compare:
- Speed: Which engine completes fastest?
- Detail: Which provides most analysis?
- Visualization: Which has the clearest charts?
- Usability: Which report is easiest to understand?
Let's learn how to handle larger datasets efficiently.
# Create a larger sample dataset
python docs/examples/generate_sample_data.py --size mediumThis creates a ~10-50MB file with 100,000 rows for testing memory management.
Before analyzing large files, check system resources:
# Check available memory
python -c "import psutil; print(f'Available memory: {psutil.virtual_memory().available / (1024**3):.1f} GB')"python bin/run_analysis.py medium_sample.csvAdditional information for large files:
File size analysis: 25.4 MB (large file detected)
Enabling chunked processing with progress tracking
Chunk size: 10,000 rows
Memory limit: 1.0 GB
Loading data: 100%|████████████| 10/10 [00:05<00:00, 2.1chunk/s]
For large files:
-
Use appropriate chunk sizes:
- Small memory: 2,000-5,000 rows
- Medium memory: 5,000-10,000 rows
- Large memory: 10,000+ rows
-
Select engines:
- SweetViz: Fastest for large files
- Main: Balance of speed and detail
- YData: Slowest but most detailed
-
Monitor progress:
- Progress bars show chunk processing
- Memory usage warnings if limits exceeded
For performance optimization and configuration options, see:
If issues occur during this tutorial:
- Check the Troubleshooting Guide for detailed solutions
- Use debug mode:
python bin/run_analysis.py --debugfor detailed error information - Verify environment setup: Ensure conda environments are properly installed
Getting started tutorial complete. Next steps:
-
Try the Examples:
python docs/examples/simple_analysis.py python docs/examples/programmatic_usage.py
-
Read the User Guide:
- USER_GUIDE.md - Usage documentation
- USER_GUIDE.md - Configuration options
- TROUBLESHOOTING.md - Detailed problem-solving guide
-
Custom Engine Development:
- Development Guide - Engine development guidelines
-
Performance Optimization:
- Performance Guide - Optimization recommendations
-
Advanced Usage:
# See individual engine testing guide # docs/api/engines/ENGINE_TESTING.md
-
Programmatic Usage:
For complete usage patterns and integration examples, see User Guide - Usage Modes.
Quick reference:
# Interactive mode (guided workflow) python bin/run_analysis.py # Direct analysis python bin/run_analysis.py data.csv
-
Automated Pipelines:
- Integrate with data processing workflows
- Set up scheduled analysis jobs
- Build custom reporting dashboards
-
Team Usage:
- Share configuration files
- Standardize analysis workflows
- Create custom engine templates
- Documentation: Complete guides in
docs/directory - Examples: Working examples in
docs/examples/ - Troubleshooting: Solutions in
TROUBLESHOOTING.md - GitHub Issues: Report bugs or request features
- Discussions: Ask questions and share use cases
The AutoCSV Profiler Suite is ready for data analysis tasks.


