socio4healthR is an R wrapper for the Python socio4health library. It is an extraction, transformation and loading (ETL) classification tool designed to simplify the intricate process of collecting and merging data from multiple sources, focusing on sociodemographic and census datasets from Colombia, Brazil, and Peru, into a harmonized dataset.
- Data Extraction: Seamlessly retrieve data from online sources via web scraping or from local files
- Format Support: CSV, Excel, JSON, Parquet, SPSS, geospatial files, and fixed-width format files
- Data Harmonization: Align and merge datasets using column mapping and value standardization
- Automatic Type Conversion: Seamlessly convert between Dask DataFrames, pandas, and R data.frames
- Advanced Features: Text classification with BERT, dictionary standardization, and data filtering
Before installing socio4healthR, ensure you have:
- R >= 4.1.0
- Python >= 3.8
- Python package
socio4health
Install the Python package:
pip install socio4health# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install from GitHub
devtools::install_github("harmonize-tools/socio4healthR")To use socio4healthR, follow these steps:
-
Load the package in your R script:
library(socio4healthR) -
Create an instance of the
Extractorclass:extractor <- s4h_extractor(input_path = "./path/to/data")
-
Extract data and create a list of DataFrames:
data_list <- s4h_run_extract( extractor = extractor, return_as = "data.frame" # Can be "dask", "pandas", or "data.frame" ) # Create a Harmonizer harmonizer <- s4h_harmonizer() # Harmonize your data merged_data <- s4h_vertical_merge(harmonizer, data_list)
For more detailed examples and use cases, please refer to the socio4health documentation.
Package Website
The socio4health website package website includes API reference, user guide, and examples. The site mainly concerns the release version, but you can also find documentation for the latest development version.
Organisation Website
Harmonize is an international project that develops cost-effective and reproducible digital tools for stakeholders in Latin America and the Caribbean (LAC) affected by a changing climate. These stakeholders include cities, small islands, highlands, and the Amazon rainforest.
The project consists of resources and tools developed in conjunction with different teams from Brazil, Colombia, Dominican Republic, Peru, and Spain.
|
|
|
Here is the contact information of authors/contributors in case users have questions or feedback.
Diego Irreño (developer)
Erick Lozano (developer)
Juan Montenegro (developer, R package maintainer)
Ingrid Mora (documentation)