This guide provides information for developers working on the AutoCSV Profiler Suite.
- Development Environment Setup
- Code Style Guidelines
- Commit Message Conventions
- Branch Naming Conventions
- Review Process
- Release Procedures
- Development Workflow
- Testing Strategy
- Debugging and Troubleshooting
- Contributing to Core Components
- Resources
- Anaconda or Miniconda (required for multi-environment architecture)
- Python 3.10 or higher for base environment
- Git for version control
- At least 3GB free disk space (2GB for conda environments, 1GB for data/outputs)
Install first by following the Installation Guide.
Additional development setup:
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Verify development tools
conda env list | grep csv-profiler
pytest --version
mypy --versionThe project uses 4 isolated conda environments:
- Base Environment (Python 3.10+): Development tools, orchestration
- csv-profiler-main (Python 3.11): Core statistical analysis
- csv-profiler-profiling (Python 3.10): YData Profiling, SweetViz
- csv-profiler-dataprep (Python 3.10): DataPrep EDA
Code Tools: Black, isort, MyPy, flake8, pytest
- Line Length: 88 characters (Black default)
- Indentation: 4 spaces (no tabs)
- String Quotes: Double quotes preferred, single quotes acceptable
- Import Organization: Standard library, third-party, local imports
- Required: All public functions and methods
- Coverage: Type annotation coverage
- Style: Use
typingmodule for complex types - Optional Dependencies: Use
TYPE_CHECKINGfor import isolation
- Docstrings: Google-style format required
- Examples: Include usage examples in docstrings
- Type Information: Document parameter and return types
- Edge Cases: Document limitations and edge cases
- No wildcard imports
- Import handling for optional dependencies
- Memory management for large files
- Environment isolation for engines
Use conventional commit format for consistency and automated changelog generation:
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
- feat: New feature
- fix: Bug fix
- docs: Documentation changes
- style: Code style changes (no logic changes)
- refactor: Code refactoring
- test: Adding or updating tests
- chore: Maintenance tasks
- engine: Engine-related changes (main, profiling, dataprep)
- ui: User interface components
- config: Configuration system
- core: Core utilities and base classes
- tests: Test-related changes
- docs: Documentation updates
feat(engine): add memory optimization for large CSV files
fix(ui): resolve delimiter detection issue with special characters
docs(api): update BaseProfiler class documentation
refactor(core): improve error handling in validation module
test(integration): add multi-environment test coverage
chore(deps): update conda environment specifications<type>/<short-description>
- feature: New functionality
- bugfix: Bug fixes
- hotfix: Critical fixes
- docs: Documentation updates
- refactor: Code refactoring
- test: Test improvements
- chore: Maintenance tasks
feature/memory-optimization
bugfix/delimiter-detection-error
docs/api-reference-update
refactor/engine-base-class
test/performance-benchmarks
chore/dependency-updates- All changes require review before merging
- Tests required for new functionality
- Documentation updates for API changes
- Performance impact assessment for core changes
- Security review for file processing changes
- Code follows style guidelines
- Tests pass and coverage maintained
- Documentation updated
- No security vulnerabilities introduced
- Performance impact acceptable
- Backward compatibility maintained
- Multi-environment compatibility verified
- Create pull request from feature branch
- Automated checks run (CI/CD, pre-commit hooks)
- Code review by maintainer or senior developer
- Address feedback and update code as needed
- Final approval and merge to main branch
- Version Location:
autocsv_profiler/version.py - Format: Semantic versioning (MAJOR.MINOR.PATCH)
- Current Version: 2.0.0
- All tests pass in all environments
- Documentation updated
- CHANGELOG.md updated with changes
- Version number incremented in
version.py - Environment specifications tested
- Security scan passed
- Update version in
autocsv_profiler/version.py - Update CHANGELOG.md with release notes
- Create release tag using semantic versioning
- Update documentation with new version info
- Verify conda environments work with release
- Create GitHub release with release notes
- Verify release artifacts
- Monitor for issues
- Update project documentation links
- Announce release (if major version)
- Major: Breaking changes, API changes
- Minor: New features, backward compatible
- Patch: Bug fixes, minor improvements
-
Update local repository
git pull origin main
-
Create feature branch
git checkout -b feature/your-feature-name
-
Make changes in appropriate environment (see Environment-Specific Development section below)
-
Run quality checks
# Format code black autocsv_profiler/ tests/ bin/ isort autocsv_profiler/ tests/ bin/ # Type checking (environment-specific) mypy --config-file=mypy_main.ini autocsv_profiler/ # Linting flake8 autocsv_profiler/ bin/
-
Run tests
# All tests pytest # Fast tests only pytest -m "not slow" # Specific test categories pytest -m unit pytest -m integration
-
Commit changes
git add . git commit -m "feat(engine): add new feature description"
-
Push and create pull request
git push origin feature/your-feature-name
For complete engine testing commands and examples, see Engine Testing Guide.
Quick reference for development testing:
# Main engine
conda activate csv-profiler-main
python autocsv_profiler/engines/main/analyzer.py test.csv "," output/
# Profiling engines (YData and SweetViz)
conda activate csv-profiler-profiling
python autocsv_profiler/engines/profiling/ydata_report.py test.csv "," output/
# DataPrep engine
conda activate csv-profiler-dataprep
python autocsv_profiler/engines/dataprep/dataprep_report.py test.csv "," output/- Unit Tests (
tests/unit/): Component isolation - Integration Tests (
tests/integration/): Cross-component workflows - Functional Tests (
tests/functional/): End-to-end features - Performance Tests (
tests/performance/): Resource validation
# All tests with coverage
pytest
# Fast tests only (exclude slow tests)
pytest -m "not slow"
# Specific test categories
pytest -m unit
pytest -m integration
pytest -m performance
# Parallel testing
pytest -n auto
# HTML coverage report
pytest --cov-report=html- Follow AAA pattern: Arrange, Act, Assert
- Use fixtures for test data and setup
- Test edge cases and error conditions
- Maintain coverage above 50% minimum
- Mock external dependencies appropriately
Enable debug mode for error information:
export DEBUG=1
python bin/run_analysis.py --debug# Check environment status
conda env list | grep csv-profiler
# Recreate environment
python bin/setup_environments.py recreate csv-profiler-main# Test specific environment imports
conda activate csv-profiler-main
python -c "import pandas, numpy, scipy; print('Main env OK')"- Reduce chunk size in
config/master_config.yml - Monitor memory usage with debug mode
- Use smaller test files for development
- Create engine file in appropriate
engines/subdirectory - Inherit from BaseProfiler abstract base class
- Implement required methods:
generate_report(),get_report_name() - Add environment specification to
config/master_config.yml - Update lazy loading in
autocsv_profiler/__init__.py - Add tests for new engine functionality
- Update master config in
config/master_config.yml - Regenerate environments using
setup_environments.py generate - Test all environments after configuration changes
- Update documentation if configuration options change
- Profile code with appropriate tools
- Test with large files (>100MB)
- Monitor memory usage during development
- Benchmark changes against baseline performance
- Document performance implications in code and commits
Key docs: ARCHITECTURE.md - Technical architecture and dependency conflict analysis External: Pre-commit, Conventional Commits
This development guide is maintained alongside the codebase. Please keep it updated as development practices evolve.