Skip to content

harmonize-tools/socio4healthR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

socio4healthR

R Wrapper for socio4health

Lifecycle: maturing MIT license GitHub contributors commits

Overview

socio4healthR is an R wrapper for the Python socio4health library. It is an extraction, transformation and loading (ETL) classification tool designed to simplify the intricate process of collecting and merging data from multiple sources, focusing on sociodemographic and census datasets from Colombia, Brazil, and Peru, into a harmonized dataset.

Key Features

  • Data Extraction: Seamlessly retrieve data from online sources via web scraping or from local files
  • Format Support: CSV, Excel, JSON, Parquet, SPSS, geospatial files, and fixed-width format files
  • Data Harmonization: Align and merge datasets using column mapping and value standardization
  • Automatic Type Conversion: Seamlessly convert between Dask DataFrames, pandas, and R data.frames
  • Advanced Features: Text classification with BERT, dictionary standardization, and data filtering

Dependencies

pandas logo Dask
Dask is a flexible parallel computing library for analytics.
pandas logo Pandas
Pandas is a well-known open source data analysis and manipulation tool.
pandas logo Geopandas
Python tools for geographic data.
numpy logo Numpy
The fundamental package for scientific computing with Python.
scrapy logo Scrapy
Framework for extracting the data you need from websites.
scrapy logo Matplotlib
Library for creating static, animated, and interactive visualizations in Python.
scrapy logo Torch
Python package for tensor computation and deep neural networks.

Installation

Requirements

Before installing socio4healthR, ensure you have:

  • R >= 4.1.0
  • Python >= 3.8
  • Python package socio4health

Install the Python package:

pip install socio4health

Install socio4healthR from GitHub

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install from GitHub
devtools::install_github("harmonize-tools/socio4healthR")

How to Use it

To use socio4healthR, follow these steps:

  1. Load the package in your R script:

    library(socio4healthR)
  2. Create an instance of the Extractor class:

    extractor <- s4h_extractor(input_path = "./path/to/data")
  3. Extract data and create a list of DataFrames:

    data_list <- s4h_run_extract(
      extractor = extractor,
      return_as = "data.frame"  # Can be "dask", "pandas", or "data.frame"
    )
    
    # Create a Harmonizer
    harmonizer <- s4h_harmonizer()
    
    # Harmonize your data
    merged_data <- s4h_vertical_merge(harmonizer, data_list)

For more detailed examples and use cases, please refer to the socio4health documentation.

Resources

Package Website

The socio4health website package website includes API reference, user guide, and examples. The site mainly concerns the release version, but you can also find documentation for the latest development version.

Organisation Website

Harmonize is an international project that develops cost-effective and reproducible digital tools for stakeholders in Latin America and the Caribbean (LAC) affected by a changing climate. These stakeholders include cities, small islands, highlands, and the Amazon rainforest.

The project consists of resources and tools developed in conjunction with different teams from Brazil, Colombia, Dominican Republic, Peru, and Spain.

Organizations

bsc logo uniandes logo

Authors / Contact information

Here is the contact information of authors/contributors in case users have questions or feedback.

Diego Irreño (developer)
Erick Lozano (developer)
Juan Montenegro (developer, R package maintainer)
Ingrid Mora (documentation)

About

This package provides R wrappers for the Python socio4health library. socio4health is an ETL classification tool designed to simplify the intricate process of collecting and merging data from multiple sources, focusing on sociodemographic and census datasets from Colombia, Brazil, and Peru, into a harmonized dataset.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages