Development Guide

This guide provides information for developers working on the AutoCSV Profiler Suite.

Development Environment Setup
Code Style Guidelines
Commit Message Conventions
Branch Naming Conventions
Review Process
Release Procedures
Development Workflow
Testing Strategy
Debugging and Troubleshooting
Contributing to Core Components
Resources

Development Environment Setup

Prerequisites

Anaconda or Miniconda (required for multi-environment architecture)
Python 3.10 or higher for base environment
Git for version control
At least 3GB free disk space (2GB for conda environments, 1GB for data/outputs)

Initial Setup

Install first by following the Installation Guide.

Additional development setup:

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

# Verify development tools
conda env list | grep csv-profiler
pytest --version
mypy --version

Environment Structure

The project uses 4 isolated conda environments:

Base Environment (Python 3.10+): Development tools, orchestration
csv-profiler-main (Python 3.11): Core statistical analysis
csv-profiler-profiling (Python 3.10): YData Profiling, SweetViz
csv-profiler-dataprep (Python 3.10): DataPrep EDA

Development Tools

Code Tools: Black, isort, MyPy, flake8, pytest

Code Style Guidelines

Formatting Standards

Line Length: 88 characters (Black default)
Indentation: 4 spaces (no tabs)
String Quotes: Double quotes preferred, single quotes acceptable
Import Organization: Standard library, third-party, local imports

Type Annotations

Required: All public functions and methods
Coverage: Type annotation coverage
Style: Use typing module for complex types
Optional Dependencies: Use TYPE_CHECKING for import isolation

Documentation Standards

Docstrings: Google-style format required
Examples: Include usage examples in docstrings
Type Information: Document parameter and return types
Edge Cases: Document limitations and edge cases

Code Guidelines

No wildcard imports
Import handling for optional dependencies
Memory management for large files
Environment isolation for engines

Commit Message Conventions

Use conventional commit format for consistency and automated changelog generation:

Format

<type>(<scope>): <description>

[optional body]

[optional footer(s)]

Types

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes (no logic changes)
refactor: Code refactoring
test: Adding or updating tests
chore: Maintenance tasks

Scopes

engine: Engine-related changes (main, profiling, dataprep)
ui: User interface components
config: Configuration system
core: Core utilities and base classes
tests: Test-related changes
docs: Documentation updates

Examples

feat(engine): add memory optimization for large CSV files
fix(ui): resolve delimiter detection issue with special characters
docs(api): update BaseProfiler class documentation
refactor(core): improve error handling in validation module
test(integration): add multi-environment test coverage
chore(deps): update conda environment specifications

Branch Naming Conventions

Format

<type>/<short-description>

Types

feature: New functionality
bugfix: Bug fixes
hotfix: Critical fixes
docs: Documentation updates
refactor: Code refactoring
test: Test improvements
chore: Maintenance tasks

Examples

feature/memory-optimization
bugfix/delimiter-detection-error
docs/api-reference-update
refactor/engine-base-class
test/performance-benchmarks
chore/dependency-updates

Review Process

Code Review Requirements

All changes require review before merging
Tests required for new functionality
Documentation updates for API changes
Performance impact assessment for core changes
Security review for file processing changes

Review Checklist

Code follows style guidelines
Tests pass and coverage maintained
Documentation updated
No security vulnerabilities introduced
Performance impact acceptable
Backward compatibility maintained
Multi-environment compatibility verified

Review Process

Create pull request from feature branch
Automated checks run (CI/CD, pre-commit hooks)
Code review by maintainer or senior developer
Address feedback and update code as needed
Final approval and merge to main branch

Release Procedures

Version Management

Version Location: autocsv_profiler/version.py
Format: Semantic versioning (MAJOR.MINOR.PATCH)
Current Version: 2.0.0

Release Checklist

Pre-Release

All tests pass in all environments
Documentation updated
CHANGELOG.md updated with changes
Version number incremented in version.py
Environment specifications tested
Security scan passed

Release Process

Update version in autocsv_profiler/version.py
Update CHANGELOG.md with release notes
Create release tag using semantic versioning
Update documentation with new version info
Verify conda environments work with release
Create GitHub release with release notes

Post-Release

Verify release artifacts
Monitor for issues
Update project documentation links
Announce release (if major version)

Version Numbering

Major: Breaking changes, API changes
Minor: New features, backward compatible
Patch: Bug fixes, minor improvements

Development Workflow

Update local repository
```
git pull origin main
```

Create feature branch

git checkout -b feature/your-feature-name

Make changes in appropriate environment (see Environment-Specific Development section below)

Run quality checks

# Format code
black autocsv_profiler/ tests/ bin/
isort autocsv_profiler/ tests/ bin/

# Type checking (environment-specific)
mypy --config-file=mypy_main.ini autocsv_profiler/

# Linting
flake8 autocsv_profiler/ bin/

Run tests

# All tests
pytest

# Fast tests only
pytest -m "not slow"

# Specific test categories
pytest -m unit
pytest -m integration

Commit changes

git add .
git commit -m "feat(engine): add new feature description"

Push and create pull request

git push origin feature/your-feature-name

Environment-Specific Development

For complete engine testing commands and examples, see Engine Testing Guide.

Quick reference for development testing:

# Main engine
conda activate csv-profiler-main
python autocsv_profiler/engines/main/analyzer.py test.csv "," output/

# Profiling engines (YData and SweetViz)
conda activate csv-profiler-profiling
python autocsv_profiler/engines/profiling/ydata_report.py test.csv "," output/

# DataPrep engine
conda activate csv-profiler-dataprep
python autocsv_profiler/engines/dataprep/dataprep_report.py test.csv "," output/

Testing Strategy

Test Organization

Unit Tests (tests/unit/): Component isolation
Integration Tests (tests/integration/): Cross-component workflows
Functional Tests (tests/functional/): End-to-end features
Performance Tests (tests/performance/): Resource validation

Running Tests

# All tests with coverage
pytest

# Fast tests only (exclude slow tests)
pytest -m "not slow"

# Specific test categories
pytest -m unit
pytest -m integration
pytest -m performance

# Parallel testing
pytest -n auto

# HTML coverage report
pytest --cov-report=html

Writing Tests

Follow AAA pattern: Arrange, Act, Assert
Use fixtures for test data and setup
Test edge cases and error conditions
Maintain coverage above 50% minimum
Mock external dependencies appropriately

Debugging and Troubleshooting

Debug Mode

Enable debug mode for error information:

export DEBUG=1
python bin/run_analysis.py --debug

Common Issues

Environment Problems

# Check environment status
conda env list | grep csv-profiler

# Recreate environment
python bin/setup_environments.py recreate csv-profiler-main

Import Errors

# Test specific environment imports
conda activate csv-profiler-main
python -c "import pandas, numpy, scipy; print('Main env OK')"

Memory Issues

Reduce chunk size in config/master_config.yml
Monitor memory usage with debug mode
Use smaller test files for development

Contributing to Core Components

Adding New Engines

Create engine file in appropriate engines/ subdirectory
Inherit from BaseProfiler abstract base class
Implement required methods: generate_report(), get_report_name()
Add environment specification to config/master_config.yml
Update lazy loading in autocsv_profiler/__init__.py
Add tests for new engine functionality

Modifying Configuration

Update master config in config/master_config.yml
Regenerate environments using setup_environments.py generate
Test all environments after configuration changes
Update documentation if configuration options change

Performance Optimization

Profile code with appropriate tools
Test with large files (>100MB)
Monitor memory usage during development
Benchmark changes against baseline performance
Document performance implications in code and commits

Resources

Key docs: ARCHITECTURE.md - Technical architecture and dependency conflict analysis External: Pre-commit, Conventional Commits

This development guide is maintained alongside the codebase. Please keep it updated as development practices evolve.

FilesExpand file tree

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Development Guide

Table of Contents

Development Environment Setup

Prerequisites

Initial Setup

Environment Structure

Development Tools

Code Style Guidelines

Formatting Standards

Type Annotations

Documentation Standards

Code Guidelines

Commit Message Conventions

Format

Types

Scopes

Examples

Branch Naming Conventions

Format

Types

Examples

Review Process

Code Review Requirements

Review Checklist

Review Process

Release Procedures

Version Management

Release Checklist

Pre-Release

Release Process

Post-Release

Version Numbering

Development Workflow

Environment-Specific Development

Testing Strategy

Test Organization

Running Tests

Writing Tests

Debugging and Troubleshooting

Debug Mode

Common Issues

Environment Problems

Import Errors

Memory Issues

Contributing to Core Components

Adding New Engines

Modifying Configuration

Performance Optimization

Resources