📘 Versão em português disponível aqui → README.md
The CNPJ Data Extractor is an open-source project that automates the download, extraction, and transformation of CNPJ (Brazilian company registry) datasets from public sources. The project is divided into two parts:
- Data Extraction: Automatically download and extract partitioned CNPJ datasets.
- Data Merging: Combine the partitioned tables into consolidated datasets for further processing or analysis.
- Automated Data Download: Multithreaded download of datasets with remote size check to avoid redundant downloads.
- Efficient Data Processing: Handles large partitioned datasets and consolidates them into a unified output.
- Flexible Export Formats: Supports CSV and Parquet.
- Modular Configuration: Paths, logs, and export options are easily configurable via a
config.yamlfile.
.
├── config
│ └── config.yaml # Configuration file for paths, formats, and data types
├── data_incoming # Folder for incoming ZIP data files
├── data_outgoing # Folder for processed output data
├── logs # Folder for log files
├── scripts # Python scripts
│ ├── cnpj_extractor.py # Script for data extraction (part 1)
│ └── cnpj_merger.py # Script for merging partitioned tables (part 2)
├── README.md # Project documentation
└── execute_model.bat # Batch script example for executing the full process (adjust your environment)
- Python 3.12+
git clone https://github.com/jmfeck/cnpj-data-extractor.git
cd cnpj-data-extractor
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txtBefore running the scripts, make sure the config.yaml file is properly configured. It contains the base URL, CSV reading parameters, export format, and expected data types for each table.
Example config.yaml:
# Base URL for the CNPJ dataset
base_url: 'https://arquivos.receitafederal.gov.br/dados/cnpj/dados_abertos_cnpj'
# CSV settings
csv_sep: ';'
csv_dec: ','
csv_quote: '"'
csv_enc: 'latin1'
# Export format: 'csv' or 'parquet'
export_format: 'parquet'
# Data types definition for the "empresa" table
dtypes:
empresa:
cnpj_basico: "str"
razao_social: "str"
natureza_juridica: "str"
qualificacao_responsavel: "str"
capital_social: "float"
porte_empresa: "str"
ente_federativo_responsavel: "str"To start the process, run the cnpj_extractor.py script.
This script will:
- Access the base URL defined in
config.yaml - Identify the latest folder using the
YYYY-MMpattern - List all
.zipfiles available in the folder - Check if each file has already been downloaded (using file size)
- Download only the necessary files using multithreading
- Save all files to the
data_incoming/folder
Run with:
python cnpj_extractor.pyAfter downloading the files, run cnpj_merger.py to process the data.
This script will:
- Read all
.zipfiles from thedata_incoming/folder - Detect the type of each file based on its prefix (e.g.,
empresa,estabelecimento, etc.) - Extract the
.csvfrom each.zip(expects only one CSV per archive) - Apply data types as defined in
config.yaml - Merge the data of each type into a single consolidated file
- Export the result to the
data_outgoing/folder in the configured format (csvorparquet)
Run with:
python cnpj_merger.pyCurrently supported export formats:
csvparquet
Support for other formats like JSON or Feather may be added in the future.
Log files are automatically saved in the logs/ folder, allowing you to monitor errors, progress, and execution time.
Contributions are welcome! Feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.