Skip to content

Latest commit

ย 

History

History
150 lines (105 loc) ยท 5.89 KB

File metadata and controls

150 lines (105 loc) ยท 5.89 KB

๐Ÿ‘‹๐Ÿฝ Hi, Iโ€™m Cierra

Iโ€™m a data engineer and bioinformatics practitioner with a strong interest in building reproducible data pipelines, scalable analytics, and biologically meaningful models.

  • ๐Ÿ‘€ Interests: Data engineering, bioinformatics, environmental & health data analytics
  • ๐ŸŒฑ Currently learning: ETL design, data visualization, and scalable ML workflows
  • ๐Ÿ’ž๏ธ Open to collaboration: Creative or technical projects involving data, biology, or AI
  • ๐Ÿ“ซ Contact: ibritt.cierra@gmail.com
  • ๐Ÿ˜„ Pronouns: she/her
  • โšก Fun fact: Iโ€™m a plant mom who loves crocheting and painting ๐ŸŒฟ๐Ÿงถ๐ŸŽจ

This repository showcases hands-on projects that reflect my growth across AI engineering, data engineering, and computational biology.

๐ŸŒ๐Ÿงฌ Data Engineering, AI & Bioinformatics Projects

A growing portfolio of hands-on projects at the intersection of data engineering, machine learning, and computational biology. These projects reflect my progression from data cleaning and pipeline design to interactive analytics, bioinformatics modeling, and scalable ML workflows.

๐Ÿ”ง Core Skills Demonstrated

  • Data Engineering: data cleaning pipelines, schema design, reshaping (LONG/WIDE), reproducible workflows
  • AI / ML Engineering: feature selection, logistic regression, model evaluation, Spark MLlib
  • Bioinformatics: DNA/RNA sequence analysis, recurrence modeling, microbiome analytics
  • Visualization & Analytics: R Shiny dashboards, geospatial mapping, correlation analysis
  • Tools & Languages: Python, R, PySpark, Apache Spark, pandas, Shiny, Leaflet

๐Ÿ“ Project Overview

๐ŸŒฌ๏ธ๐Ÿฆ  Air ร— Microbiome Dashboard Map

๐Ÿ“‚ air_microbiome_dashboard_map/

An interactive R Shiny application for exploring relationships between air pollution exposure and nasal microbiome composition across time and geography.

What this project demonstrates

  • End-to-end analytical dashboard design
  • Geospatial visualization with Leaflet
  • Correlation analysis across biological and environmental domains

Key capabilities

  • Genus-level and โ€œAll generaโ€ microbiome exploration
  • Interactive pollution ร— microbiome correlation plots
  • Pearson, Spearman, and Kendall correlations with statistical readouts
  • Heatmaps for pollutant and microbiome intensity
  • Dynamic filtering by year, genus, pollutant, and metric

๐Ÿ”— Live App: https://cierrab2319.shinyapps.io/air-micro-explorer/

Skills grown

  • Translating multi-omics questions into interactive analytics
  • Combining environmental health data with microbiome datasets
  • Statistical reasoning and visual storytelling

๐Ÿงน๐ŸŒซ๏ธ Cleaning & Formatting EPA Air Quality Data

๐Ÿ“‚ cleaning_air_data/

A Python-based data engineering toolkit for cleaning, filtering, and reshaping EPA AQS air quality datasets (PMโ‚‚.โ‚…, PMโ‚โ‚€, NOโ‚‚).

Included scripts

  • pollution_data_cleaner_gui.py
    • GUI-driven cleaning tool for raw EPA CSVs
    • Adds sample IDs and pollutant labels
    • Supports geographic filtering (state, city, county, CBSA, coordinates)
    • Outputs clean, analysis-ready CSVs
  • format_by_location.py
    • Reshapes cleaned data into LONG (tidy) or WIDE (dashboard-ready) formats
    • Groups by state, city, site, CBSA, or custom keys
    • Exports to CSV or Excel (multi-sheet)

Skills grown

  • Real-world data preprocessing and validation
  • Building user-friendly data tools (GUI + CLI)
  • Designing reusable data pipelines for analytics and dashboards

๐Ÿฉบ๐Ÿ“Š Diabetes Prediction with Apache Spark

๐Ÿ“‚ diabetes_prediction/

A machine learning project using PySpark and Spark MLlib to predict diabetes outcomes using logistic regression.

What this project demonstrates

  • Distributed ML workflows using Apache Spark
  • Feature selection and preprocessing at scale
  • Model training, evaluation, and persistence

Key components

  • Data ingestion and cleaning in Spark
  • Logistic Regression model using Spark MLlib
  • Model performance evaluation
  • Save/load trained models

Skills grown

  • Scalable ML engineering with Spark
  • Transitioning from pandas-based ML to distributed systems
  • Understanding production-oriented ML workflows

๐Ÿงฌ Rosalind Bioinformatics Practice

๐Ÿ“‚ rosalind/

A collection of foundational bioinformatics algorithms implemented in Python using problems from Rosalind.

Completed exercises

  • count_nucleotides.py โ€” DNA base frequency counting
  • transcribed_dna.py โ€” DNA โ†’ RNA transcription
  • complement_dna.py โ€” reverse complement generation
  • recurrence_rabbit.py โ€” population growth via recurrence relations

Skills grown

  • Algorithmic thinking in computational biology
  • DNA/RNA sequence manipulation
  • Translating biological rules into clean, testable code

๐Ÿš€ How These Projects Fit Together

These projects reflect a progressive skill trajectory:

  • Data engineering foundations โ†’ cleaning, reshaping, and validating environmental datasets
  • Analytics & visualization โ†’ turning complex data into interpretable dashboards
  • Bioinformatics fundamentals โ†’ sequence analysis and biological modeling
  • AI/ML engineering โ†’ scalable machine learning with Spark

Together, they support my growth toward roles in:

  • AI / ML Engineering
  • Data Engineering
  • Bioinformatics & Computational Biology
  • Environmental & Public Health Analytics

๐Ÿ“Œ Next Steps (In Progress / Planned)

  • Integrating cleaned air quality data directly into microbiome dashboards
  • Expanding microbiome analysis to higher taxonomic levels
  • Applying ML models to environmentalโ€“biological interaction data
  • Building reproducible pipelines for multi-omics analytics

โญ If youโ€™re reviewing this repository: thank you for taking the time โ€” each folder represents an intentional step in building production-ready, interdisciplinary technical skills.