Datacrafter is an open-source NoSQL ETL (Extract, Transform, Load) tool designed for data extraction, transformation, and loading with a focus on NoSQL data formats. It provides a command-line interface for building data pipelines that extract data from various sources, process it, and load it into different destinations.
Note: This project is in alpha stage. Code migration from a closed repository is in progress, and documentation is being continuously improved.
- NoSQL-first approach: JSON Lines and BSON are default file formats
- Task chaining and data pipelines: Build complex ETL workflows
- Command-line first: Full CLI interface for automation and scripting
- Automated data extraction: Extract data from APIs, files, and web sources
- Semantic type identification: Automatic data type detection and conversion
- Automatic documentation generation: Generate documentation for your data pipelines
- Data discovery: Discover data formats and possible transformations
- Multiple source and destination formats: Support for various file formats and databases
pip install datacraftergit clone https://github.com/apicrafter/datacrafter.git
cd datacrafter
pip install -e .- Python 3.8 or higher
- See requirements.txt for full dependency list
datacrafter init my-project
cd my-projectThis creates a new project directory with a datacrafter.yml configuration file.
Edit datacrafter.yml to define your data pipeline:
version: "1"
project-name: "my-project"
project-id: "unique-id"
extractor:
mode: "singlefile"
type: "file-csv"
method: "url"
config:
url: "https://example.com/data.csv"
processor:
config:
autoid: true
autotype: true
keymap:
type: "names"
fields:
old_column: "new_column"
destination:
type: "file-jsonl"
fileprefix: "output"datacrafter rundatacrafter statusdatacrafter init [--path PATH] [--name NAME]- Initialize a new projectdatacrafter run [--path PATH] [--verbose] [--quiet]- Execute the data pipelinedatacrafter status [--path PATH]- Show status of latest pipeline executiondatacrafter check [--path PATH]- Validate configuration and environmentdatacrafter clean [--path PATH] [--storage]- Remove temporary filesdatacrafter log [--path PATH] [--lines N]- Show log of latest operationsdatacrafter version- Show version information
datacrafter config validate [--path PATH]- Validate project configurationdatacrafter config schema- Show expected configuration file schema
datacrafter schema- Generate and print data schemadatacrafter metrics- Show dataset statistics and analysisdatacrafter builds- Manage builds (create, remove, list)datacrafter push- Push data to remote storagedatacrafter ui- Launch web user interface
Extractors pull data from various sources:
- Local or remote files: CSV, JSON, XML, XLS/XLSX, BSON, JSONL
- APIs:
- REST API (Work in progress)
- APIBackuper compatible (Done)
- RSS/Atom Feed (Work in progress)
- CMS:
- WordPress (Work in progress)
- Microsoft SharePoint (Planned)
- Common APIs (Planned):
- Email, FTP, SFTP
- Online services (Planned):
- Yandex Metrika, Yandex.Webmaster
Sources are files or databases created by extractors:
File Sources:
- JSON Lines ✅
- CSV ✅
- BSON ✅
- XLS/XLSX ✅
- XML ✅
- JSON (Work in progress)
- YAML (Work in progress)
- SQLite (Work in progress)
Database Sources (Planned):
- SQL databases via SQLAlchemy
- PostgreSQL, ClickHouse
- MongoDB, ArangoDB, ElasticSearch/OpenSearch
Processors transform data during the pipeline:
- Mappers: Map data fields from one schema to another
keymap: Replace key/column names ✅typemap: Convert data types ✅
- Custom code: Python scripts for data manipulation ✅
- Custom tools: Command-line tools for data manipulation (Work in progress)
- Enrichers: Data and metadata enrichment (Planned)
Destinations store the processed data:
File Destinations:
- BSON ✅
- JSON Lines ✅
- CSV ✅
- Parquet (Work in progress)
- JSON (Work in progress)
- YAML (Planned)
- DataPackage/Frictionless Data (Planned)
Database Destinations:
- MongoDB (Work in progress)
- ArangoDB (Planned)
- ClickHouse (Planned)
- Any SQL via SQLAlchemy (Planned)
Storage Options (Planned):
- Local filesystem ✅
- S3, FTP, SFTP
- WebDAV, Google Drive, Dropbox, Yandex.Disk
Alerting mechanisms (Planned):
- Email alerts
- Other notification methods
A datacrafter project typically has this structure:
my-project/
├── datacrafter.yml # Project configuration
├── data/ # Extracted data
├── output/ # Processed output
├── state.json # Execution state
└── datacrafter.log # Execution logs
See the full configuration schema:
datacrafter config schemaOr check the example below:
version: "1"
project-name: "my-project"
project-id: "unique-id"
extractor:
mode: "singlefile" # singlefile, api, code
type: "file-csv" # file-csv, file-json, file-xml, etc.
method: "url" # url, urlbypattern, apibackuper
force: true # Force re-download
config:
url: "https://example.com/data.csv"
processor:
config:
autoid: true # Auto-generate IDs
autotype: false # Auto-detect types
error_strategy: "skip" # skip, fail, retry
max_retries: 3
keymap: # Optional field mapping
type: "names"
fields:
old_name: "new_name"
typemap: # Optional type conversion
field_name: "int" # int, float, date, datetime, bool
custom: # Optional custom code
type: "script"
code: "path/to/script.py"
destination:
type: "file-jsonl" # file-jsonl, file-csv, file-bson, mongodb, etc.
fileprefix: "output"
compress: "gz" # Optional: gz, bz2, xz, zip, zstFor complete examples, see: https://github.com/apicrafter/datacrafter-examples
version: "1"
project-name: "csv-to-jsonl"
project-id: "example-1"
extractor:
mode: "singlefile"
type: "file-csv"
method: "url"
config:
url: "https://example.com/data.csv"
processor:
config:
autoid: true
autotype: true
destination:
type: "file-jsonl"
fileprefix: "output"version: "1"
project-name: "api-to-mongo"
project-id: "example-2"
extractor:
mode: "api"
type: "api"
method: "apibackuper"
config:
endpoint: "https://api.example.com/data"
processor:
config:
autoid: true
keymap:
type: "names"
fields:
api_id: "_id"
api_name: "name"
destination:
type: "mongodb"
connstr: "mongodb://localhost:27017"
dbname: "mydb"
tablename: "mydata"pytest# Linting with pylint
pylint datacrafter/
# Linting with flake8
flake8 datacrafter/
# Type checking (if using mypy)
mypy datacrafter/Code Quality Status: The codebase maintains a pylint score of 9.12/10, with comprehensive code quality improvements implemented in version 1.0.4.
Contributions are welcome! Please see the IMPROVEMENTS.md file for areas that need work.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Dependencies - Dependency management guide
- IMPROVEMENTS.md - Known issues and improvement suggestions
- CHANGELOG.md - Version history
Licensed under the Apache License 2.0. See LICENSE for details.
- Issues: https://github.com/apicrafter/datacrafter/issues
- Examples: https://github.com/apicrafter/datacrafter-examples
Ivan Begtin
Status: Alpha - Active development in progress