diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png b/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png new file mode 100644 index 0000000..6c9b7c2 Binary files /dev/null and b/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png differ diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.qmd b/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.qmd new file mode 100644 index 0000000..d709558 --- /dev/null +++ b/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.qmd @@ -0,0 +1,236 @@ +--- +title: "GSoC '25 Phase 1: Enabling OMOP CDM Tables and Preprocessing in HealthBase.jl" +description: "A summary of my project for Google Summer of Code - 2025 (Phase 1)" +author: "Kosuri Lakshmi Indu" +date: "7/12/2025" +toc: true +engine: julia +image: false +categories: + - gsoc + - omop cdm + - julia health + - healthbase + - tables + - preprocessing +--- + +# Introduction 👋 + +Hi everyone! I'm Kosuri Lakshmi Indu, a third-year undergraduate student majoring in Computer Science and a GSoC 2025 contributor. Over the past few months, I’ve had the opportunity to work with the JuliaHealth community, where I got a chance to learn, contribute, and get involved in projects focused on improving how we work with real-world health data. + +As part of this project, I contributed to HealthBase.jl, a Julia package within the JuliaHealth ecosystem. During Phase 1, I focused on implementing both of these key features - adding seamless `Tables.jl` integration for working with OMOP CDM tables and developing a set of reusable preprocessing functions to make data modeling easier and more consistent. + +In this blog post, I'll walk you through everything we accomplished in Phase 1 from the motivation behind these additions, to examples of how they work in practice. + +1. You can find my [**GSoC'25 Project Link**](https://summerofcode.withgoogle.com/programs/2025/projects/xpSEgu5b) + +2. You can check out the project repository here: [**HealthBase.jl**](https://github.com/JuliaHealth/HealthBase.jl) + +3. Official Documentation of HealthBase.jl: [**Documentation**](https://juliahealth.org/HealthBase.jl/dev/) + +4. If you want, you can connect with me on [**LinkedIn**](https://www.linkedin.com/in/kosuri-indu/) and [**GitHub**](https://github.com/kosuri-indu/). + +# Background + +## What is HealthBase.jl? + +**HealthBase.jl** is a lightweight foundational package that serves as a shared namespace across the JuliaHealth ecosystem. It's designed to provide consistent interfaces, common utilities, and templates for building health data workflows in Julia, all while staying minimal and extensible. + +As part of this project, we expanded HealthBase.jl to support working directly with OMOP CDM data. While the HealthTable interface is defined in HealthBase.jl as a generic, reusable interface for working with health data tables, the actual implementation for the OMOP Common Data Model lives in the OMOPCommonDataModel.jl package. + +This design allows HealthBase.jl to provide a flexible interface that can be extended for any health data standard - not just OMOP CDM. So if someone wants to make HealthTable work for their packages, they can use the HealthTable interface from HealthBase.jl and define the necessary extensions accordingly. + +# Project Description + +This included introducing a new type called **`HealthTable`**, which wraps OMOP CDM tables with built-in schema validation and metadata attachment. We also made these tables compatible with the wider Julia data ecosystem by implementing the **`Tables.jl`** interface, making it easier to plug OMOP CDM datasets into standard tools for modeling and analysis. To support real-world workflows, we added a suite of preprocessing utilities tailored to observational health data, including one-hot encoding, vocabulary compression, and concept mapping. All of these features aim to make it easier for researchers to load, validate, and prepare OMOP CDM datasets in a reproducible and scalable way - while keeping the code clean and modular. + +# Project Goals + +The main goal of this GSoC project was to improve how researchers interact with OMOP CDM data in Julia by extending and strengthening the capabilities of **HealthBase.jl**. This involved building a structured, schema-aware interface for working with health data and providing built-in tools for preprocessing. Specifically, we focused on: + +1. **Introduce a schema-aware table interface for OMOP CDM**: Develop a new type of structure that wraps OMOP CDM tables in a consistent, validated format. This interface should use `Tables.jl` under the hood and provide column-level metadata, schema enforcement, and compatibility with downstream tabular workflows. + +2. **Implement reusable preprocessing utilities for health data workflows**: Add essential preprocessing functions like: + - `one_hot_encode` for converting categorical columns into binary indicators, + - `apply_vocabulary_compression` for grouping rare categories under a label, + - `map_concepts` and `map_concepts!` for translating concept IDs to readable names via DuckDB or user-defined mappings. + +3. **Integrate HealthBase.jl with the JuliaHealth ecosystem**: Ensure `HealthBase.jl` plays a foundational role in JuliaHealth by interoperating with other packages like `OMOPCommonDataModel.jl`, `Tables.jl`, `DuckDB.jl`, `DataFrames.jl` etc. This makes it easier to build reproducible, modular workflows within the Julia ecosystem. + +These goals lay the foundation for future JuliaHealth tooling, making OMOP CDM data easier to validate, preprocess, and use in reproducible health data science workflows. + +# Tasks + +## 1. Core `HealthTable` Interface with `Tables.jl` Connection + +A major part of this project was introducing a new type called `HealthTable`, which makes it easier to work with OMOP CDM tables in Julia in a reliable and standardized way. + +### What is `HealthTable`? + +`HealthTable` is a wrapper around a Julia `DataFrame` that: + +- Validates your OMOP CDM data against the official OMOP CDM schema +- Connects your data to Julia's `Tables.jl` interface, so you can use it with any table-compatible package (like `DataFrames.jl` etc.) +- Attaches metadata about each column (e.g., what kind of concept it represents) +- Gives detailed error messages if your data doesn't follow the expected OMOP CDM format +- Uses `PrettyTables.jl` under the hood to display the table in a clean and readable format in the REPL or Jupyter notebooks + +### How is it defined? + +The type is defined using Julia's `@kwdef` macro to allow keyword-based construction: + +```julia +@kwdef struct HealthTable{T} + source::T +end + +Tables.schema(ht) # View schema (column names and types) +Tables.rows(ht) # Iterate over rows +Tables.columns(ht) # Access columns as named tuples +Tables.materializer(typeof(ht)) # Used to materialize tables +``` +Here, source is the original validated data (usually a DataFrame), and all logic is built around enforcing the schema and providing utilities on top of this source. Once wrapped in a HealthTable, you can interact with it using the complete Tables.jl interface. This makes HealthTable compatible with the entire Julia Tables ecosystem, meaning users can directly use OMOP CDM data with existing tooling, without needing custom adapters. + +### Creating a `HealthTable` + +```julia +using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB +using HealthBase + +df = DataFrame( + person_id = [1, 2, 3, 4], + gender_concept_id = [8507, 8532, 8507, 8532], + year_of_birth = [1990, 1985, 1992, 1980], + race_concept_id = [8527, 8516, 8527, 8527], +) + +ht = HealthTable(df; omop_cdm_version="v5.4.1") +``` +
+![](./extension_loading.png) +
+
+ HealthTable Extension Loading +
+
+
+ ![](./healthtable.png) + HealthTable +
+ +### Accessing the Source + +Each `HealthTable` instance stores the original validated data in `ht.source`: + +```julia +typeof(ht.source) # DataFrame +metadata(ht.source, "omop_cdm_version") # "v5.4.1" +colmetadata(ht.source, :gender_concept_id) # Per-column metadata +``` + +### Handling Invalid Tables + +If the input table does not conform to the expected OMOP CDM schema - for example, if a column has the wrong data type is present - the HealthTable constructor throws a detailed and user-friendly error. This makes it easier to catch mistakes early and ensures that only well-structured, validated data enters your pipeline. You can optionally relax this strict enforcement by setting disable_type_enforcement = true, in which case the constructor will emit warnings instead of throwing errors. + +## 2. Preprocessing Functions + +Once your data is wrapped in a `HealthTable`, you can apply several built-in preprocessing functions to prepare it for modeling. These functions help transform OMOP CDM data into formats that are more interpretable, compact, and ML-friendly. + +### Concept Mapping + +Replaces concept IDs (e.g., 8507) with readable names (e.g., "Male") using an OMOP CDM vocabulary table in DuckDB. If no schema is provided, the function will use the default main schema in the DuckDB connection. It supports both single and multiple columns, and allows custom naming for output columns. This greatly enhances data interpretability and allows models to work with semantically meaningful features. + +```julia +conn = DBInterface.connect(DuckDB.DB, "path_to_file.duckdb") + +# Single column (adds gender_concept_id_mapped) +ht_mapped = map_concepts(ht, :gender_concept_id, conn; schema = "schema_name") + +# Multiple columns, drop original +ht_mapped2 = map_concepts(ht, [:gender_concept_id, :race_concept_id], conn; + new_cols=["gender", "race"], drop_original=true) + +# In-place variant +map_concepts!(ht, :gender_concept_id, conn; schema = "schema_name") +``` + +
+
+ ![](./map_concepts.png) + Map Concepts +
+
+ +You can also perform concept mapping manually without relying on a database connection. This is especially useful for smaller datasets or when you already know the categorical values. For instance, by defining a custom dictionary such as Dict(8507 => "Male", 8532 => "Female"), you can create a new column by mapping over the existing concept ID column using functions like Base.map or map!. This manual approach allows for flexible labeling and is particularly helpful during exploratory data analysis or early-stage prototyping when a full vocabulary table isn't available or necessary. + +### Vocabulary Compression + +Groups rare categorical values into an `"Other"` or a user-specified label to simplify high-cardinality columns. This is useful for reducing sparsity in features with many rare categories. The function calculates the frequency of each level and retains only those above a threshold, replacing the rest. It ensures that model inputs remain interpretable and avoids overfitting on underrepresented categories. + +```julia +ht_compressed = apply_vocabulary_compression(ht_mapped; cols = [:race_concept_id], + min_freq = 2, other_label = "Other") +``` +
+
+ ![](./vocab_compression.png) + Vocabulary Compression +
+ +### One-Hot Encoding + +Converts a categorical column into binary columns. This prepares your features for ML models that can't handle raw categorical values. The function works on both HealthTable and DataFrame objects and allows you to specify whether to drop the original column. It's especially useful when working with encoded health conditions, demographics, or other categorical features. + + +```julia +ht_ohe = one_hot_encode(ht_compressed; cols = [:gender_concept_id, :race_concept_id]) +``` + +
+
+ ![](./one_hot_encoding.png) + One Hot Encoding +
+
+ +Each of these preprocessing functions is designed to be both composable and schema-aware. This means you can mix and match transformations like one-hot encoding, vocabulary compression, and concept mapping depending on the needs of your workflow. You have the flexibility to work directly with a HealthTable or switch to a regular DataFrame when needed. This modular design ensures that your data processing steps are reproducible and consistent across different JuliaHealth tools, making health data analysis more efficient, reliable, and seamlessly integrated with Julia's modern data ecosystem. + +# Contribution Beyond Coding + +## 1. Organizing Meetings and Communication + +Throughout the project, I had regular weekly meetings with my mentor, Jacob Zelko, where we discussed progress, clarified doubts, and made plans for the upcoming tasks. These sessions helped me better understand design decisions and refine my implementations with expert feedback. In addition to our meetings, I actively communicated via Zulip and Slack, where we discussed code behavior, errors, ideas, and other project-related decisions in detail. This consistent back-and-forth ensured a clear direction and rapid iteration. + +## 2. Engaging with the JuliaHealth Ecosystem + +Beyond contributing code to HealthBase.jl, I also engaged with the broader JuliaHealth ecosystem. After discussing with my mentor, I opened and contributed to issues in related JuliaHealth repositories identifying potential bugs and suggesting enhancements. These contributions aimed to improve the coherence and usability of JuliaHealth tools as a whole, not just within my assigned project. + +# Conclusion and Future Development + +Contributing to HealthBase.jl during Phase 1 of GSoC has been a rewarding and insightful journey. It gave me the opportunity to dive deep into the structure of OMOP CDM, explore Julia's composable interfaces like `Tables.jl`, and build features that directly support observational health workflows. Learning to design with extensibility in mind, especially when working with healthcare data has shaped how I now approach open-source problems. + +Looking ahead, here are some of the directions I’d like to explore to further strengthen `HealthBase.jl` and its integration within the JuliaHealth ecosystem: + +- **Refine schema handling** + Improve how internal schema checks reflect the structure of the underlying data source. This includes better alignment with OMOP CDM specifications and improved flexibility when dealing with edge cases or schema variations. + +- **Strengthen Tables.jl integration** + Enhance the robustness of how `HealthTable` interacts with the `Tables.jl` interface - ensuring better compatibility and reducing any overhead when working with downstream packages like `DataFrames.jl`. + +- **Add new preprocessing functions** + Extend the current toolkit by implementing more real-world utilities such as missing value imputation, cohort filtering etc. + +- **Address related issues in the ecosystem** + Collaborate with maintainers to help resolve open issues related to the project in `OMOPCommonDataModel.jl`, especially [Issue #41](https://github.com/JuliaHealth/OMOPCommonDataModel.jl/issues/41) and [Issue #40](https://github.com/JuliaHealth/OMOPCommonDataModel.jl/issues/40) + +Overall, this phase has not only improved the package but also helped me grow in terms of design thinking, working with abstractions, and contributing to a larger ecosystem. I'm looking forward to what comes next and to making HealthBase even more useful for the JuliaHealth community and beyond. + + +# Acknowledgements + +A big thank you to **Jacob S. Zelko** for being such a kind and thoughtful mentor throughout this project. His clear guidance, encouragement, and helpful feedback made a huge difference at every step. I'm also really thankful to the **JuliaHealth** community for creating such a welcoming and inspiring space to learn, build, and grow. It's been a joy to be part of it. + +[Jacob S. Zelko](https://jacobzelko.com): aka, [TheCedarPrince](https://github.com/TheCedarPrince) + +_Note: This blog post was drafted with the assistance of LLM technologies to support grammar, clarity and structure._ diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png b/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png new file mode 100644 index 0000000..173a1f3 Binary files /dev/null and b/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png differ diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png b/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png new file mode 100644 index 0000000..f791c63 Binary files /dev/null and b/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png differ diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png b/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png new file mode 100644 index 0000000..7a5ddc2 Binary files /dev/null and b/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png differ diff --git a/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png b/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png new file mode 100644 index 0000000..3333aa9 Binary files /dev/null and b/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png differ diff --git a/_freeze/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1/execute-results/html.json b/_freeze/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1/execute-results/html.json new file mode 100644 index 0000000..0094693 --- /dev/null +++ b/_freeze/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1/execute-results/html.json @@ -0,0 +1,12 @@ +{ + "hash": "d088621135fc827281bbeab8b807fab9", + "result": { + "engine": "julia", + "markdown": "---\ntitle: \"GSoC '25 Phase 1: Enabling OMOP CDM Tables and Preprocessing in HealthBase.jl\"\ndescription: \"A summary of my project for Google Summer of Code - 2025 (Phase 1)\"\nauthor: \"Kosuri Lakshmi Indu\"\ndate: \"7/12/2025\"\ntoc: true\nengine: julia\nimage: false\ncategories:\n - gsoc\n - omop cdm\n - julia health\n - healthbase\n - tables\n - preprocessing\n---\n\n\n\n# Introduction 👋\n\nHi everyone! I'm Kosuri Lakshmi Indu, a third-year undergraduate student majoring in Computer Science and a GSoC 2025 contributor. Over the past few months, I’ve had the opportunity to work with the JuliaHealth community, where I got a chance to learn, contribute, and get involved in projects focused on improving how we work with real-world health data.\n\nAs part of this project, I contributed to HealthBase.jl, a Julia package within the JuliaHealth ecosystem. During Phase 1, I focused on implementing both of these key features - adding seamless `Tables.jl` integration for working with OMOP CDM tables and developing a set of reusable preprocessing functions to make data modeling easier and more consistent.\n\nIn this blog post, I'll walk you through everything we accomplished in Phase 1 from the motivation behind these additions, to examples of how they work in practice.\n\n1. You can find my [**GSoC'25 Project Link**](https://summerofcode.withgoogle.com/programs/2025/projects/xpSEgu5b)\n\n2. You can check out the project repository here: [**HealthBase.jl**](https://github.com/JuliaHealth/HealthBase.jl)\n\n3. Official Documentation of HealthBase.jl: [**Documentation**](https://juliahealth.org/HealthBase.jl/dev/)\n\n4. If you want, you can connect with me on [**LinkedIn**](https://www.linkedin.com/in/kosuri-indu/) and [**GitHub**](https://github.com/kosuri-indu/).\n\n# Background\n\n## What is HealthBase.jl?\n\n**HealthBase.jl** is a lightweight foundational package that serves as a shared namespace across the JuliaHealth ecosystem. It's designed to provide consistent interfaces, common utilities, and templates for building health data workflows in Julia, all while staying minimal and extensible.\n\nAs part of this project, we expanded HealthBase.jl to support working directly with OMOP CDM data. While the HealthTable interface is defined in HealthBase.jl as a generic, reusable interface for working with health data tables, the actual implementation for the OMOP Common Data Model lives in the OMOPCommonDataModel.jl package. \n\nThis design allows HealthBase.jl to provide a flexible interface that can be extended for any health data standard - not just OMOP CDM. So if someone wants to make HealthTable work for their packages, they can use the HealthTable interface from HealthBase.jl and define the necessary extensions accordingly.\n\n# Project Description\n\nThis included introducing a new type called **`HealthTable`**, which wraps OMOP CDM tables with built-in schema validation and metadata attachment. We also made these tables compatible with the wider Julia data ecosystem by implementing the **`Tables.jl`** interface, making it easier to plug OMOP CDM datasets into standard tools for modeling and analysis. To support real-world workflows, we added a suite of preprocessing utilities tailored to observational health data, including one-hot encoding, vocabulary compression, and concept mapping. All of these features aim to make it easier for researchers to load, validate, and prepare OMOP CDM datasets in a reproducible and scalable way - while keeping the code clean and modular.\n\n# Project Goals\n\nThe main goal of this GSoC project was to improve how researchers interact with OMOP CDM data in Julia by extending and strengthening the capabilities of **HealthBase.jl**. This involved building a structured, schema-aware interface for working with health data and providing built-in tools for preprocessing. Specifically, we focused on:\n\n1. **Introduce a schema-aware table interface for OMOP CDM**: Develop a new type of structure that wraps OMOP CDM tables in a consistent, validated format. This interface should use `Tables.jl` under the hood and provide column-level metadata, schema enforcement, and compatibility with downstream tabular workflows.\n\n2. **Implement reusable preprocessing utilities for health data workflows**: Add essential preprocessing functions like:\n - `one_hot_encode` for converting categorical columns into binary indicators,\n - `apply_vocabulary_compression` for grouping rare categories under a label,\n - `map_concepts` and `map_concepts!` for translating concept IDs to readable names via DuckDB or user-defined mappings.\n\n3. **Integrate HealthBase.jl with the JuliaHealth ecosystem**: Ensure `HealthBase.jl` plays a foundational role in JuliaHealth by interoperating with other packages like `OMOPCommonDataModel.jl`, `Tables.jl`, `DuckDB.jl`, `DataFrames.jl` etc. This makes it easier to build reproducible, modular workflows within the Julia ecosystem.\n\nThese goals lay the foundation for future JuliaHealth tooling, making OMOP CDM data easier to validate, preprocess, and use in reproducible health data science workflows.\n\n# Tasks\n\n## 1. Core `HealthTable` Interface with `Tables.jl` Connection\n\nA major part of this project was introducing a new type called `HealthTable`, which makes it easier to work with OMOP CDM tables in Julia in a reliable and standardized way.\n\n### What is `HealthTable`?\n\n`HealthTable` is a wrapper around a Julia `DataFrame` that:\n\n- Validates your OMOP CDM data against the official OMOP CDM schema\n- Connects your data to Julia's `Tables.jl` interface, so you can use it with any table-compatible package (like `DataFrames.jl` etc.)\n- Attaches metadata about each column (e.g., what kind of concept it represents)\n- Gives detailed error messages if your data doesn't follow the expected OMOP CDM format\n- Uses `PrettyTables.jl` under the hood to display the table in a clean and readable format in the REPL or Jupyter notebooks\n\n### How is it defined?\n\nThe type is defined using Julia's `@kwdef` macro to allow keyword-based construction:\n\n```julia\n@kwdef struct HealthTable{T}\n source::T\nend\n\nTables.schema(ht) # View schema (column names and types)\nTables.rows(ht) # Iterate over rows\nTables.columns(ht) # Access columns as named tuples\nTables.materializer(typeof(ht)) # Used to materialize tables\n```\nHere, source is the original validated data (usually a DataFrame), and all logic is built around enforcing the schema and providing utilities on top of this source. Once wrapped in a HealthTable, you can interact with it using the complete Tables.jl interface. This makes HealthTable compatible with the entire Julia Tables ecosystem, meaning users can directly use OMOP CDM data with existing tooling, without needing custom adapters.\n\n### Creating a `HealthTable`\n\n```julia\nusing DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB\nusing HealthBase\n\ndf = DataFrame(\n person_id = [1, 2, 3, 4],\n gender_concept_id = [8507, 8532, 8507, 8532],\n year_of_birth = [1990, 1985, 1992, 1980],\n race_concept_id = [8527, 8516, 8527, 8527],\n)\n\nht = HealthTable(df; omop_cdm_version=\"v5.4.1\")\n```\n
\n![](./extension_loading.png)\n
\n
\n HealthTable Extension Loading\n
\n
\n
\n ![](./healthtable.png)\n HealthTable\n
\n\n### Accessing the Source\n\nEach `HealthTable` instance stores the original validated data in `ht.source`:\n\n```julia\ntypeof(ht.source) # DataFrame\nmetadata(ht.source, \"omop_cdm_version\") # \"v5.4.1\"\ncolmetadata(ht.source, :gender_concept_id) # Per-column metadata\n```\n\n### Handling Invalid Tables\n\nIf the input table does not conform to the expected OMOP CDM schema - for example, if a column has the wrong data type is present - the HealthTable constructor throws a detailed and user-friendly error. This makes it easier to catch mistakes early and ensures that only well-structured, validated data enters your pipeline. You can optionally relax this strict enforcement by setting disable_type_enforcement = true, in which case the constructor will emit warnings instead of throwing errors.\n\n## 2. Preprocessing Functions\n\nOnce your data is wrapped in a `HealthTable`, you can apply several built-in preprocessing functions to prepare it for modeling. These functions help transform OMOP CDM data into formats that are more interpretable, compact, and ML-friendly.\n\n### Concept Mapping\n\nReplaces concept IDs (e.g., 8507) with readable names (e.g., \"Male\") using an OMOP CDM vocabulary table in DuckDB. If no schema is provided, the function will use the default main schema in the DuckDB connection. It supports both single and multiple columns, and allows custom naming for output columns. This greatly enhances data interpretability and allows models to work with semantically meaningful features.\n\n```julia\nconn = DBInterface.connect(DuckDB.DB, \"path_to_file.duckdb\")\n\n# Single column (adds gender_concept_id_mapped)\nht_mapped = map_concepts(ht, :gender_concept_id, conn; schema = \"schema_name\")\n\n# Multiple columns, drop original\nht_mapped2 = map_concepts(ht, [:gender_concept_id, :race_concept_id], conn;\n new_cols=[\"gender\", \"race\"], drop_original=true)\n\n# In-place variant\nmap_concepts!(ht, :gender_concept_id, conn; schema = \"schema_name\")\n```\n\n
\n
\n ![](./map_concepts.png)\n Map Concepts\n
\n
\n\nYou can also perform concept mapping manually without relying on a database connection. This is especially useful for smaller datasets or when you already know the categorical values. For instance, by defining a custom dictionary such as Dict(8507 => \"Male\", 8532 => \"Female\"), you can create a new column by mapping over the existing concept ID column using functions like Base.map or map!. This manual approach allows for flexible labeling and is particularly helpful during exploratory data analysis or early-stage prototyping when a full vocabulary table isn't available or necessary.\n\n### Vocabulary Compression\n\nGroups rare categorical values into an `\"Other\"` or a user-specified label to simplify high-cardinality columns. This is useful for reducing sparsity in features with many rare categories. The function calculates the frequency of each level and retains only those above a threshold, replacing the rest. It ensures that model inputs remain interpretable and avoids overfitting on underrepresented categories.\n\n```julia\nht_compressed = apply_vocabulary_compression(ht_mapped; cols = [:race_concept_id],\n min_freq = 2, other_label = \"Other\")\n```\n
\n
\n ![](./vocab_compression.png)\n Vocabulary Compression\n
\n\n### One-Hot Encoding\n\nConverts a categorical column into binary columns. This prepares your features for ML models that can't handle raw categorical values. The function works on both HealthTable and DataFrame objects and allows you to specify whether to drop the original column. It's especially useful when working with encoded health conditions, demographics, or other categorical features.\n\n\n```julia\nht_ohe = one_hot_encode(ht_compressed; cols = [:gender_concept_id, :race_concept_id])\n```\n\n
\n
\n ![](./one_hot_encoding.png)\n One Hot Encoding\n
\n
\n\nEach of these preprocessing functions is designed to be both composable and schema-aware. This means you can mix and match transformations like one-hot encoding, vocabulary compression, and concept mapping depending on the needs of your workflow. You have the flexibility to work directly with a HealthTable or switch to a regular DataFrame when needed. This modular design ensures that your data processing steps are reproducible and consistent across different JuliaHealth tools, making health data analysis more efficient, reliable, and seamlessly integrated with Julia's modern data ecosystem.\n\n# Contribution Beyond Coding\n\n## 1. Organizing Meetings and Communication\n\nThroughout the project, I had regular weekly meetings with my mentor, Jacob Zelko, where we discussed progress, clarified doubts, and made plans for the upcoming tasks. These sessions helped me better understand design decisions and refine my implementations with expert feedback. In addition to our meetings, I actively communicated via Zulip and Slack, where we discussed code behavior, errors, ideas, and other project-related decisions in detail. This consistent back-and-forth ensured a clear direction and rapid iteration.\n\n## 2. Engaging with the JuliaHealth Ecosystem\n\nBeyond contributing code to HealthBase.jl, I also engaged with the broader JuliaHealth ecosystem. After discussing with my mentor, I opened and contributed to issues in related JuliaHealth repositories identifying potential bugs and suggesting enhancements. These contributions aimed to improve the coherence and usability of JuliaHealth tools as a whole, not just within my assigned project. \n\n# Conclusion and Future Development\n\nContributing to HealthBase.jl during Phase 1 of GSoC has been a rewarding and insightful journey. It gave me the opportunity to dive deep into the structure of OMOP CDM, explore Julia's composable interfaces like `Tables.jl`, and build features that directly support observational health workflows. Learning to design with extensibility in mind, especially when working with healthcare data has shaped how I now approach open-source problems.\n\nLooking ahead, here are some of the directions I’d like to explore to further strengthen `HealthBase.jl` and its integration within the JuliaHealth ecosystem:\n\n- **Refine schema handling** \n Improve how internal schema checks reflect the structure of the underlying data source. This includes better alignment with OMOP CDM specifications and improved flexibility when dealing with edge cases or schema variations.\n\n- **Strengthen Tables.jl integration** \n Enhance the robustness of how `HealthTable` interacts with the `Tables.jl` interface - ensuring better compatibility and reducing any overhead when working with downstream packages like `DataFrames.jl`.\n\n- **Add new preprocessing functions** \n Extend the current toolkit by implementing more real-world utilities such as missing value imputation, cohort filtering etc.\n\n- **Address related issues in the ecosystem** \n Collaborate with maintainers to help resolve open issues related to the project in `OMOPCommonDataModel.jl`, especially [Issue #41](https://github.com/JuliaHealth/OMOPCommonDataModel.jl/issues/41) and [Issue #40](https://github.com/JuliaHealth/OMOPCommonDataModel.jl/issues/40)\n\nOverall, this phase has not only improved the package but also helped me grow in terms of design thinking, working with abstractions, and contributing to a larger ecosystem. I'm looking forward to what comes next and to making HealthBase even more useful for the JuliaHealth community and beyond.\n\n\n# Acknowledgements \n\nA big thank you to **Jacob S. Zelko** for being such a kind and thoughtful mentor throughout this project. His clear guidance, encouragement, and helpful feedback made a huge difference at every step. I'm also really thankful to the **JuliaHealth** community for creating such a welcoming and inspiring space to learn, build, and grow. It's been a joy to be part of it.\n\n[Jacob S. Zelko](https://jacobzelko.com): aka, [TheCedarPrince](https://github.com/TheCedarPrince)\n\n_Note: This blog post was drafted with the assistance of LLM technologies to support grammar, clarity and structure._\n\n", + "supporting": [ + "gsoc-phase1_files" + ], + "filters": [], + "includes": {} + } +} \ No newline at end of file diff --git a/_freeze/JuliaHealthBlog/posts/indu-plp-part1/plp-part1/execute-results/html.json b/_freeze/JuliaHealthBlog/posts/indu-plp-part1/plp-part1/execute-results/html.json index a1d37b9..fd1611e 100644 --- a/_freeze/JuliaHealthBlog/posts/indu-plp-part1/plp-part1/execute-results/html.json +++ b/_freeze/JuliaHealthBlog/posts/indu-plp-part1/plp-part1/execute-results/html.json @@ -2,9 +2,9 @@ "hash": "beea0755e6da9931b4f2fe1647eb5893", "result": { "engine": "julia", - "markdown": "---\ntitle: \"PLP-Pipeline Series Part 1: From Research Question to Cohort Construction\"\ndescription: \"Kicking off the PLP-Pipeline blog series - how we define research questions and construct cohorts using OMOP CDM and Julia tools.\"\nauthor: \"Kosuri Lakshmi Indu\"\ndate: \"4/12/2025\"\nbibliography: ./references.bib\ncsl: ./../../ieee-with-url.csl\ntoc: true\nengine: julia\nimage: false\ncategories:\n - patient-level prediction\n - omop cdm\n - observational health\n---\n\n\n\n\n# Introduction 👋\n\nHi everyone! I’m **Kosuri Lakshmi Indu**, a third-year undergraduate student in Computer Science and an aspiring GSoC 2025 contributor. My interest in using data science for public health led me to the **JuliaHealth** community and, under the mentorship of Jacob S. Zelko, I began working on a project titled **PLP-Pipeline**. This project focuses on building modular, efficient tooling for Patient-Level Prediction (PLP) entirely in Julia, using the OMOP Common Data Model (OMOP CDM).\n\nIn this post, I’ll walk through the first part of a three-part blog series documenting my work on building a Patient-Level Prediction (PLP) pipeline in Julia. Each post focuses on a different stage of the pipeline:\n\n1. **From Research Question to Cohort Construction (this post)**\n\n2. From Raw Clinical Data to Predictive Models\n\n3. Lessons Learned, Key Challenges, and What Comes Next\n\nIn Part 1, we’ll start at the very beginning-formulating the research question, exploring the OMOP CDM, setting up the local database, and defining target and outcome cohorts using Julia tools. Whether you're a health researcher, a GSoC aspirant, or a Julia enthusiast, I hope this gives you a clear and accessible introduction to how observational health research can be made more composable, reproducible, and efficient using Julia.\n\nYou can find my [**PLP-Pipeline Project Link Here**](https://github.com/kosuri-indu/PLP-Pipeline)\n\n[**LinkedIn**](https://www.linkedin.com/in/kosuri-indu/) | [**GitHub**](https://github.com/kosuri-indu/)\n \n# Background\n\n## What is Observational Health?\n\nObservational health research examines real-world patient data such as electronic health records (EHRs), claims, and registries to understand health and disease outside of controlled trial environments. This type of research plays a vital role in informing decisions by clinicians, policymakers, and researchers, especially when addressing population-level health questions and disparities.\n\nA core aspect of observational health is the use of phenotype definitions, which describe a specific set of observable patient characteristics (e.g., diagnosis codes, symptoms, demographics, biomarkers) that define a population of interest. Creating accurate and reproducible phenotype definitions is essential for ensuring research validity. However, challenges such as missing data, demographic biases, and inconsistently recorded information can significantly impact the reliability of these definitions.\n\nTo support reproducible research at scale, communities like OHDSI (Observational Health Data Sciences and Informatics) have developed standards such as the OMOP Common Data Model (CDM) and workflows for developing computable phenotype definitions. \n\nIn our work, we utilize observational health data already structured through the OMOP Common Data Model (CDM). We construct patient cohorts based on existing phenotype definitions. These cohorts then serve as the basis for building patient-level prediction models, enabling us to explore and generate insights that can support data-driven clinical decision-making.\n\n## What Is the OMOP CDM?\n\nThe **Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)** is a standardized framework for organizing and analyzing observational healthcare data. The OMOP CDM converts diverse sources of health data into a common format that supports large-scale, systematic analysis.\n\nThe OMOP CDM organizes data into a consistent set of relational tables like `condition_occurrence`, `drug_exposure`, `person`, `visit_occurrence` etc, using standardized vocabularies. These tables are interconnected, allowing for relational analysis across a patient's medical history.\n\nBy transforming diverse healthcare datasets into a common format, the OMOP CDM enables reproducibility, interoperability, and large-scale studies across institutions and populations.\n\n
\n
\n ![](./omopcdm.png)\n\n OMOP Common Data Model\n
\n\n## What is Patient-Level Prediction (PLP)?\n\n**Patient-Level Prediction (PLP)** is a data-driven approach that uses machine learning or statistical models to estimate the risk of specific clinical outcomes for individual patients, based on their historical healthcare data.\n\nThe key goal of PLP is to answer personalized clinical questions like:\n\n> *\"For patients who present with chest pain leading to a hospital visit, can we predict which of these patients will go on to experience a heart attack after their hospital visit?\"*\n\nPLP focuses on using observational patient data such as diagnoses, medications, procedures, and demographics - to predict individual-level risks of future health events. While it may sound similar to precision medicine, there's a key distinction: precision medicine aims to tailor treatment plans based on a patient’s genetics, environment, and lifestyle, whereas PLP is specifically about forecasting outcomes for individual patients using data-driven models. These predictions can support timely and personalized clinical decisions.\n\n## Why PLP in Julia?\n\nWhile established PLP workflows are well-supported in R through OHDSI's suite of packages, our work explores an alternative approach using Julia - a high-performance language that enables building efficient and reproducible pipelines from end to end.\n\nJulia offers several advantages that make it well-suited for observational health research:\n\n- **Composability**: Julia’s modular design supports reusable components, making PLP pipelines easier to maintain and extend.\n \n- **Speed**: With performance comparable to C, Julia efficiently handles large, complex healthcare datasets.\n\n- **Unified Ecosystem**: Tools like `OHDSICohortExpressions.jl`, `DataFrames.jl`, `MLJ.jl` etc. integrate seamlessly, enabling cohort definition, data transformation, and modeling within one consistent environment.\n\nAdditionally, Julia features a rich and growing ecosystem with many tools for scientific computing and data science, making it a strong alternative for modern health informatics workflows.\n\n
\n
\n ![](./julia.webp)\n\n Julia Equivalents\n
\n\n# Reference: Foundation from the OHDSI PLP Framework\n\nThroughout the development of this PLP pipeline, I referenced the methodology presented in the following paper:\n\n> Reps, J. M., Schuemie, M. J., Suchard, M. A., Ryan, P. B., Rijnbeek, P. R., & Madigan, D. (2018). Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. *Journal of the American Medical Informatics Association, 25(8), 969–975*. [https://doi.org/10.1093/jamia/ocy032](https://doi.org/10.1093/jamia/ocy032)\n\nThis paper laid the groundwork for my implementation and inspired several core components of the project — from data curation to model evaluation.\n\n## Methodologies from the Paper\n\n1. Standardized Framework for PLP - Outlines a consistent process for building patient-level prediction models across datasets and settings.\n\n2. Defining the Prediction Problem - Emphasizes clear definition of target, outcome, and time-at-risk for valid predictions.\n\n3. Cohort Definition and Data Extraction - Uses standardized OMOP CDM cohorts to ensure reproducibility and consistent data extraction.\n\n4. Feature Construction - Derives meaningful predictors from observational data like conditions and demographics.\n\n5. Model Training and Evaluation - Trains ML models and evaluates them using metrics like AUC and cross-validation.\n\nWe are adapting this framework for our PLP pipeline to ensure a consistent approach.\n\n# Research Question\n\nAs an example, here is one question we could potentially explore within this PLP pipeline:\n\n> **Among patients diagnosed with hypertension, who will go on to develop diabetes?**\n\nThe focus is on identifying patients with hypertension who may progress to diabetes based on their medical history and risk factors.\n\n## Cohort Construction\n\nCohorts are groups of patients defined by specific criteria that are relevant to the research question. For this task, two main cohorts need to be defined:\n\n- **Target Cohort**: This refers to the group of patients we want to make predictions for. In our case, it includes patients who have been diagnosed with hypertension. These patients serve as the starting point for our prediction timeline.\n\n- **Outcome Cohort**: This refers to the clinical event we aim to predict. In our case, it includes patients from the target cohort who are subsequently diagnosed with diabetes within a specified time window. This event marks the outcome that our model will learn to forecast.\n\nThese cohort definitions are central to structuring the data pipeline, as they form the foundation for downstream tasks like feature extraction, model training, and evaluation.\n\n# Defining Cohorts using OHDSICohortExpressions.jl\n\nIn the context of this research, I received a 20GB synthetic dataset that contains 1,115,000 fake patients (1,000,000 alive and 115,000 deceased), each with 3 years of medical history. This dataset was provided as a DuckDB database, a lightweight, high-performance analytical database that allows fast querying of large datasets directly from local files without the need for a server. For more details on how to use DuckDB with Julia, refer to the [DuckDB Julia Client Documentation](https://duckdb.org/docs/stable/clients/julia.html).\n\nFor cohort creation, I used OHDSI cohort definitions provided directly by my mentor in the form of two JSON files:\n\n- **Target cohort**: `Hypertension.json`\n- **Outcome cohort**: `Diabetes.json`\n\nTo execute them, I used the [OHDSICohortExpressions](https://github.com/MechanicalRabbit/OHDSICohortExpressions.jl) to convert the JSON definitions into SQL queries, which were then run against the DuckDB database to extract the relevant cohorts.\n\nHere’s the breakdown of the process:\n\n1. Reading the cohort definitions from JSON files.\n\n2. Connecting to the DuckDB database, which stores the synthetic patient data.\n\n3. Translating the cohort definitions into SQL using OHDSICohortExpressions.jl.\n\n4. Executing the SQL queries to create the target and outcome cohorts in the database.\n\n### Cohort Definition Code\n\nHere’s how we set up the DuckDB connection and define cohorts using OHDSI JSON definitions:\n\n**File:** `cohort_definition.jl`\n\n```julia\nimport DBInterface: connect, execute\nimport FunSQL: reflect, render\nimport OHDSICohortExpressions: translate\nusing DuckDB, DataFrames\n\n# We use DrWatson.jl to manage project-relative paths using `datadir(...)`\n# This ensures portable and reproducible file access within the project\n\n# Read the cohort definitions from JSON files (Hypertension and Diabetes definitions)\ntarget_json = read(datadir(\"exp_raw\", \"definitions\", \"Hypertension.json\"), String) # Target cohort (Hypertension)\noutcome_json = read(datadir(\"exp_raw\", \"definitions\", \"Diabetes.json\"), String) # Outcome cohort (Diabetes)\n\n# Establish a connection to the DuckDB database\nconnection = connect(DuckDB.DB, datadir(\"exp_raw\", \"synthea_1M_3YR.duckdb\"))\n\n# Function to process a cohort definition (translate the JSON to SQL and execute)\nfunction process_cohort(def_json, def_id, conn)\n catalog = reflect(conn; schema=\"dbt_synthea_dev\", dialect=:duckdb) # Reflect the database schema\n fun_sql = translate(def_json; cohort_definition_id=def_id) # Translate the JSON to SQL query\n sql = render(catalog, fun_sql) # Render the SQL query\n\n # Ensure the cohort table exists before inserting\n execute(conn, \"\"\"\n CREATE TABLE IF NOT EXISTS dbt_synthea_dev.cohort (\n cohort_definition_id INTEGER,\n subject_id INTEGER,\n cohort_start_date DATE,\n cohort_end_date DATE\n );\n \"\"\")\n\n # Execute the SQL query to insert cohort data into the database\n execute(conn, \"\"\"\n INSERT INTO dbt_synthea_dev.cohort\n SELECT * FROM ($sql) AS foo;\n \"\"\")\nend\n\n# Process the target and outcome cohorts\nprocess_cohort(target_json, 1, connection) # Define the target cohort (Hypertension)\nprocess_cohort(outcome_json, 2, connection) # Define the outcome cohort (Diabetes)\n\nclose!(connection)\n```\n\nThis code uses FunSQL.jl and OHDSICohortExpressions.jl to translate and render OHDSI ATLAS cohort definitions into executable SQL for DuckDB. The `translate` function from OHDSICohortExpressions.jl converts the JSON cohort definitions (Hypertension and Diabetes) into a FunSQL query representation. Then, `reflect` is used to introspect the DuckDB schema, and `render` from FunSQL.jl turns the abstract query into valid DuckDB SQL. The `process_cohort` function executes this SQL using `execute` to insert the resulting cohort data into the cohort table. This pipeline allows OHDSI cohort logic to be ported directly into a Julia workflow without relying on external OHDSI R tools.\n\n# Wrapping Up\n\nThis post covered the foundations of the PLP pipeline:\n\n- Explored observational health research, OMOP CDM, PLP, and Julia for large-scale clinical data analysis.\n\n- Formulated the research question: predicting diabetes progression in hypertension patients.\n\n- Explained OMOP CDM's role in standardizing clinical data.\n\n- Defined target and outcome cohorts for the study.\n\n- Used Julia to convert cohort definitions into executable SQL for DuckDB querying.\n\nIn the next post, I’ll walk through how we go from raw clinical data to predictive modeling, with Julia code examples that highlight feature extraction, data processing, and model training-bringing the full PLP pipeline to life.\n\n## Acknowledgements\n\nThanks to Jacob Zelko for his mentorship, clarity, and constant feedback throughout the project. I also thank the JuliaHealth community for building an ecosystem where composable science can thrive.\n\n[Jacob S. Zelko](https://jacobzelko.com): aka, [TheCedarPrince](https://github.com/TheCedarPrince)\n\n_Note: This blog post was drafted with the assistance of LLM technologies to support grammar, clarity and structure._\n\n", + "markdown": "---\ntitle: \"PLP-Pipeline Series Part 1: From Research Question to Cohort Construction\"\ndescription: \"Kicking off the PLP-Pipeline blog series - how we define research questions and construct cohorts using OMOP CDM and Julia tools.\"\nauthor: \"Kosuri Lakshmi Indu\"\ndate: \"4/12/2025\"\nbibliography: ./references.bib\ncsl: ./../../ieee-with-url.csl\ntoc: true\nengine: julia\nimage: false\ncategories:\n - patient-level prediction\n - omop cdm\n - observational health\n---\n\n\n\n# Introduction 👋\n\nHi everyone! I’m **Kosuri Lakshmi Indu**, a third-year undergraduate student in Computer Science and an aspiring GSoC 2025 contributor. My interest in using data science for public health led me to the **JuliaHealth** community and, under the mentorship of Jacob S. Zelko, I began working on a project titled **PLP-Pipeline**. This project focuses on building modular, efficient tooling for Patient-Level Prediction (PLP) entirely in Julia, using the OMOP Common Data Model (OMOP CDM).\n\nIn this post, I’ll walk through the first part of a three-part blog series documenting my work on building a Patient-Level Prediction (PLP) pipeline in Julia. Each post focuses on a different stage of the pipeline:\n\n1. **From Research Question to Cohort Construction (this post)**\n\n2. From Raw Clinical Data to Predictive Models\n\n3. Lessons Learned, Key Challenges, and What Comes Next\n\nIn Part 1, we’ll start at the very beginning-formulating the research question, exploring the OMOP CDM, setting up the local database, and defining target and outcome cohorts using Julia tools. Whether you're a health researcher, a GSoC aspirant, or a Julia enthusiast, I hope this gives you a clear and accessible introduction to how observational health research can be made more composable, reproducible, and efficient using Julia.\n\nYou can find my [**PLP-Pipeline Project Link Here**](https://github.com/kosuri-indu/PLP-Pipeline)\n\n[**LinkedIn**](https://www.linkedin.com/in/kosuri-indu/) | [**GitHub**](https://github.com/kosuri-indu/)\n \n# Background\n\n## What is Observational Health?\n\nObservational health research examines real-world patient data such as electronic health records (EHRs), claims, and registries to understand health and disease outside of controlled trial environments. This type of research plays a vital role in informing decisions by clinicians, policymakers, and researchers, especially when addressing population-level health questions and disparities.\n\nA core aspect of observational health is the use of phenotype definitions, which describe a specific set of observable patient characteristics (e.g., diagnosis codes, symptoms, demographics, biomarkers) that define a population of interest. Creating accurate and reproducible phenotype definitions is essential for ensuring research validity. However, challenges such as missing data, demographic biases, and inconsistently recorded information can significantly impact the reliability of these definitions.\n\nTo support reproducible research at scale, communities like OHDSI (Observational Health Data Sciences and Informatics) have developed standards such as the OMOP Common Data Model (CDM) and workflows for developing computable phenotype definitions. \n\nIn our work, we utilize observational health data already structured through the OMOP Common Data Model (CDM). We construct patient cohorts based on existing phenotype definitions. These cohorts then serve as the basis for building patient-level prediction models, enabling us to explore and generate insights that can support data-driven clinical decision-making.\n\n## What Is the OMOP CDM?\n\nThe **Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)** is a standardized framework for organizing and analyzing observational healthcare data. The OMOP CDM converts diverse sources of health data into a common format that supports large-scale, systematic analysis.\n\nThe OMOP CDM organizes data into a consistent set of relational tables like `condition_occurrence`, `drug_exposure`, `person`, `visit_occurrence` etc, using standardized vocabularies. These tables are interconnected, allowing for relational analysis across a patient's medical history.\n\nBy transforming diverse healthcare datasets into a common format, the OMOP CDM enables reproducibility, interoperability, and large-scale studies across institutions and populations.\n\n
\n
\n ![](./omopcdm.png)\n\n OMOP Common Data Model\n
\n\n## What is Patient-Level Prediction (PLP)?\n\n**Patient-Level Prediction (PLP)** is a data-driven approach that uses machine learning or statistical models to estimate the risk of specific clinical outcomes for individual patients, based on their historical healthcare data.\n\nThe key goal of PLP is to answer personalized clinical questions like:\n\n> *\"For patients who present with chest pain leading to a hospital visit, can we predict which of these patients will go on to experience a heart attack after their hospital visit?\"*\n\nPLP focuses on using observational patient data such as diagnoses, medications, procedures, and demographics - to predict individual-level risks of future health events. While it may sound similar to precision medicine, there's a key distinction: precision medicine aims to tailor treatment plans based on a patient’s genetics, environment, and lifestyle, whereas PLP is specifically about forecasting outcomes for individual patients using data-driven models. These predictions can support timely and personalized clinical decisions.\n\n## Why PLP in Julia?\n\nWhile established PLP workflows are well-supported in R through OHDSI's suite of packages, our work explores an alternative approach using Julia - a high-performance language that enables building efficient and reproducible pipelines from end to end.\n\nJulia offers several advantages that make it well-suited for observational health research:\n\n- **Composability**: Julia’s modular design supports reusable components, making PLP pipelines easier to maintain and extend.\n \n- **Speed**: With performance comparable to C, Julia efficiently handles large, complex healthcare datasets.\n\n- **Unified Ecosystem**: Tools like `OHDSICohortExpressions.jl`, `DataFrames.jl`, `MLJ.jl` etc. integrate seamlessly, enabling cohort definition, data transformation, and modeling within one consistent environment.\n\nAdditionally, Julia features a rich and growing ecosystem with many tools for scientific computing and data science, making it a strong alternative for modern health informatics workflows.\n\n
\n
\n ![](./julia.webp)\n\n Julia Equivalents\n
\n\n# Reference: Foundation from the OHDSI PLP Framework\n\nThroughout the development of this PLP pipeline, I referenced the methodology presented in the following paper:\n\n> Reps, J. M., Schuemie, M. J., Suchard, M. A., Ryan, P. B., Rijnbeek, P. R., & Madigan, D. (2018). Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. *Journal of the American Medical Informatics Association, 25(8), 969–975*. [https://doi.org/10.1093/jamia/ocy032](https://doi.org/10.1093/jamia/ocy032)\n\nThis paper laid the groundwork for my implementation and inspired several core components of the project — from data curation to model evaluation.\n\n## Methodologies from the Paper\n\n1. Standardized Framework for PLP - Outlines a consistent process for building patient-level prediction models across datasets and settings.\n\n2. Defining the Prediction Problem - Emphasizes clear definition of target, outcome, and time-at-risk for valid predictions.\n\n3. Cohort Definition and Data Extraction - Uses standardized OMOP CDM cohorts to ensure reproducibility and consistent data extraction.\n\n4. Feature Construction - Derives meaningful predictors from observational data like conditions and demographics.\n\n5. Model Training and Evaluation - Trains ML models and evaluates them using metrics like AUC and cross-validation.\n\nWe are adapting this framework for our PLP pipeline to ensure a consistent approach.\n\n# Research Question\n\nAs an example, here is one question we could potentially explore within this PLP pipeline:\n\n> **Among patients diagnosed with hypertension, who will go on to develop diabetes?**\n\nThe focus is on identifying patients with hypertension who may progress to diabetes based on their medical history and risk factors.\n\n## Cohort Construction\n\nCohorts are groups of patients defined by specific criteria that are relevant to the research question. For this task, two main cohorts need to be defined:\n\n- **Target Cohort**: This refers to the group of patients we want to make predictions for. In our case, it includes patients who have been diagnosed with hypertension. These patients serve as the starting point for our prediction timeline.\n\n- **Outcome Cohort**: This refers to the clinical event we aim to predict. In our case, it includes patients from the target cohort who are subsequently diagnosed with diabetes within a specified time window. This event marks the outcome that our model will learn to forecast.\n\nThese cohort definitions are central to structuring the data pipeline, as they form the foundation for downstream tasks like feature extraction, model training, and evaluation.\n\n# Defining Cohorts using OHDSICohortExpressions.jl\n\nIn the context of this research, I received a 20GB synthetic dataset that contains 1,115,000 fake patients (1,000,000 alive and 115,000 deceased), each with 3 years of medical history. This dataset was provided as a DuckDB database, a lightweight, high-performance analytical database that allows fast querying of large datasets directly from local files without the need for a server. For more details on how to use DuckDB with Julia, refer to the [DuckDB Julia Client Documentation](https://duckdb.org/docs/stable/clients/julia.html).\n\nFor cohort creation, I used OHDSI cohort definitions provided directly by my mentor in the form of two JSON files:\n\n- **Target cohort**: `Hypertension.json`\n- **Outcome cohort**: `Diabetes.json`\n\nTo execute them, I used the [OHDSICohortExpressions](https://github.com/MechanicalRabbit/OHDSICohortExpressions.jl) to convert the JSON definitions into SQL queries, which were then run against the DuckDB database to extract the relevant cohorts.\n\nHere’s the breakdown of the process:\n\n1. Reading the cohort definitions from JSON files.\n\n2. Connecting to the DuckDB database, which stores the synthetic patient data.\n\n3. Translating the cohort definitions into SQL using OHDSICohortExpressions.jl.\n\n4. Executing the SQL queries to create the target and outcome cohorts in the database.\n\n### Cohort Definition Code\n\nHere’s how we set up the DuckDB connection and define cohorts using OHDSI JSON definitions:\n\n**File:** `cohort_definition.jl`\n\n```julia\nimport DBInterface: connect, execute\nimport FunSQL: reflect, render\nimport OHDSICohortExpressions: translate\nusing DuckDB, DataFrames\n\n# We use DrWatson.jl to manage project-relative paths using `datadir(...)`\n# This ensures portable and reproducible file access within the project\n\n# Read the cohort definitions from JSON files (Hypertension and Diabetes definitions)\ntarget_json = read(datadir(\"exp_raw\", \"definitions\", \"Hypertension.json\"), String) # Target cohort (Hypertension)\noutcome_json = read(datadir(\"exp_raw\", \"definitions\", \"Diabetes.json\"), String) # Outcome cohort (Diabetes)\n\n# Establish a connection to the DuckDB database\nconnection = connect(DuckDB.DB, datadir(\"exp_raw\", \"synthea_1M_3YR.duckdb\"))\n\n# Function to process a cohort definition (translate the JSON to SQL and execute)\nfunction process_cohort(def_json, def_id, conn)\n catalog = reflect(conn; schema=\"dbt_synthea_dev\", dialect=:duckdb) # Reflect the database schema\n fun_sql = translate(def_json; cohort_definition_id=def_id) # Translate the JSON to SQL query\n sql = render(catalog, fun_sql) # Render the SQL query\n\n # Ensure the cohort table exists before inserting\n execute(conn, \"\"\"\n CREATE TABLE IF NOT EXISTS dbt_synthea_dev.cohort (\n cohort_definition_id INTEGER,\n subject_id INTEGER,\n cohort_start_date DATE,\n cohort_end_date DATE\n );\n \"\"\")\n\n # Execute the SQL query to insert cohort data into the database\n execute(conn, \"\"\"\n INSERT INTO dbt_synthea_dev.cohort\n SELECT * FROM ($sql) AS foo;\n \"\"\")\nend\n\n# Process the target and outcome cohorts\nprocess_cohort(target_json, 1, connection) # Define the target cohort (Hypertension)\nprocess_cohort(outcome_json, 2, connection) # Define the outcome cohort (Diabetes)\n\nclose!(connection)\n```\n\nThis code uses FunSQL.jl and OHDSICohortExpressions.jl to translate and render OHDSI ATLAS cohort definitions into executable SQL for DuckDB. The `translate` function from OHDSICohortExpressions.jl converts the JSON cohort definitions (Hypertension and Diabetes) into a FunSQL query representation. Then, `reflect` is used to introspect the DuckDB schema, and `render` from FunSQL.jl turns the abstract query into valid DuckDB SQL. The `process_cohort` function executes this SQL using `execute` to insert the resulting cohort data into the cohort table. This pipeline allows OHDSI cohort logic to be ported directly into a Julia workflow without relying on external OHDSI R tools.\n\n# Wrapping Up\n\nThis post covered the foundations of the PLP pipeline:\n\n- Explored observational health research, OMOP CDM, PLP, and Julia for large-scale clinical data analysis.\n\n- Formulated the research question: predicting diabetes progression in hypertension patients.\n\n- Explained OMOP CDM's role in standardizing clinical data.\n\n- Defined target and outcome cohorts for the study.\n\n- Used Julia to convert cohort definitions into executable SQL for DuckDB querying.\n\nIn the next post, I’ll walk through how we go from raw clinical data to predictive modeling, with Julia code examples that highlight feature extraction, data processing, and model training-bringing the full PLP pipeline to life.\n\n## Acknowledgements\n\nThanks to Jacob Zelko for his mentorship, clarity, and constant feedback throughout the project. I also thank the JuliaHealth community for building an ecosystem where composable science can thrive.\n\n[Jacob S. Zelko](https://jacobzelko.com): aka, [TheCedarPrince](https://github.com/TheCedarPrince)\n\n_Note: This blog post was drafted with the assistance of LLM technologies to support grammar, clarity and structure._\n\n", "supporting": [ - "plp-part1_files/figure-html" + "plp-part1_files" ], "filters": [], "includes": {} diff --git a/docs/JuliaHealthBlog/index.html b/docs/JuliaHealthBlog/index.html index ab53ff6..ead5573 100644 --- a/docs/JuliaHealthBlog/index.html +++ b/docs/JuliaHealthBlog/index.html @@ -2,7 +2,7 @@ - + @@ -33,16 +33,15 @@ - - + - + - + + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i diff --git a/docs/JuliaHealthBlog/posts/divyansh-gsoc/gsoc-2024-fellows.html b/docs/JuliaHealthBlog/posts/divyansh-gsoc/gsoc-2024-fellows.html index 983e683..36ca31b 100644 --- a/docs/JuliaHealthBlog/posts/divyansh-gsoc/gsoc-2024-fellows.html +++ b/docs/JuliaHealthBlog/posts/divyansh-gsoc/gsoc-2024-fellows.html @@ -2,7 +2,7 @@ - + @@ -24,9 +24,8 @@ vertical-align: middle; } /* CSS for syntax highlighting */ -html { -webkit-text-size-adjust: 100%; } pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } +pre > code.sourceCode > span { line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -37,7 +36,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } +pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -89,16 +88,15 @@ - - + - + - + + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png new file mode 100644 index 0000000..6c9b7c2 Binary files /dev/null and b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/extension_loading.png differ diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.html b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.html new file mode 100644 index 0000000..e0980b9 --- /dev/null +++ b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/gsoc-phase1.html @@ -0,0 +1,874 @@ + + + + + + + + + + + + +GSoC ’25 Phase 1: Enabling OMOP CDM Tables and Preprocessing in HealthBase.jl – JuliaHealth + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

GSoC ’25 Phase 1: Enabling OMOP CDM Tables and Preprocessing in HealthBase.jl

+
+
gsoc
+
omop cdm
+
julia health
+
healthbase
+
tables
+
preprocessing
+
+
+ +
+
+ A summary of my project for Google Summer of Code - 2025 (Phase 1) +
+
+ + +
+ +
+
Author
+
+

Kosuri Lakshmi Indu

+
+
+ +
+
Published
+
+

July 12, 2025

+
+
+ + +
+ + + +
+ + +
+

Introduction 👋

+

Hi everyone! I’m Kosuri Lakshmi Indu, a third-year undergraduate student majoring in Computer Science and a GSoC 2025 contributor. Over the past few months, I’ve had the opportunity to work with the JuliaHealth community, where I got a chance to learn, contribute, and get involved in projects focused on improving how we work with real-world health data.

+

As part of this project, I contributed to HealthBase.jl, a Julia package within the JuliaHealth ecosystem. During Phase 1, I focused on implementing both of these key features - adding seamless Tables.jl integration for working with OMOP CDM tables and developing a set of reusable preprocessing functions to make data modeling easier and more consistent.

+

In this blog post, I’ll walk you through everything we accomplished in Phase 1 from the motivation behind these additions, to examples of how they work in practice.

+
    +
  1. You can find my GSoC’25 Project Link

  2. +
  3. You can check out the project repository here: HealthBase.jl

  4. +
  5. Official Documentation of HealthBase.jl: Documentation

  6. +
  7. If you want, you can connect with me on LinkedIn and GitHub.

  8. +
+
+
+

Background

+
+

What is HealthBase.jl?

+

HealthBase.jl is a lightweight foundational package that serves as a shared namespace across the JuliaHealth ecosystem. It’s designed to provide consistent interfaces, common utilities, and templates for building health data workflows in Julia, all while staying minimal and extensible.

+

As part of this project, we expanded HealthBase.jl to support working directly with OMOP CDM data. While the HealthTable interface is defined in HealthBase.jl as a generic, reusable interface for working with health data tables, the actual implementation for the OMOP Common Data Model lives in the OMOPCommonDataModel.jl package.

+

This design allows HealthBase.jl to provide a flexible interface that can be extended for any health data standard - not just OMOP CDM. So if someone wants to make HealthTable work for their packages, they can use the HealthTable interface from HealthBase.jl and define the necessary extensions accordingly.

+
+
+
+

Project Description

+

This included introducing a new type called HealthTable, which wraps OMOP CDM tables with built-in schema validation and metadata attachment. We also made these tables compatible with the wider Julia data ecosystem by implementing the Tables.jl interface, making it easier to plug OMOP CDM datasets into standard tools for modeling and analysis. To support real-world workflows, we added a suite of preprocessing utilities tailored to observational health data, including one-hot encoding, vocabulary compression, and concept mapping. All of these features aim to make it easier for researchers to load, validate, and prepare OMOP CDM datasets in a reproducible and scalable way - while keeping the code clean and modular.

+
+
+

Project Goals

+

The main goal of this GSoC project was to improve how researchers interact with OMOP CDM data in Julia by extending and strengthening the capabilities of HealthBase.jl. This involved building a structured, schema-aware interface for working with health data and providing built-in tools for preprocessing. Specifically, we focused on:

+
    +
  1. Introduce a schema-aware table interface for OMOP CDM: Develop a new type of structure that wraps OMOP CDM tables in a consistent, validated format. This interface should use Tables.jl under the hood and provide column-level metadata, schema enforcement, and compatibility with downstream tabular workflows.

  2. +
  3. Implement reusable preprocessing utilities for health data workflows: Add essential preprocessing functions like:

    +
      +
    • one_hot_encode for converting categorical columns into binary indicators,
    • +
    • apply_vocabulary_compression for grouping rare categories under a label,
    • +
    • map_concepts and map_concepts! for translating concept IDs to readable names via DuckDB or user-defined mappings.
    • +
  4. +
  5. Integrate HealthBase.jl with the JuliaHealth ecosystem: Ensure HealthBase.jl plays a foundational role in JuliaHealth by interoperating with other packages like OMOPCommonDataModel.jl, Tables.jl, DuckDB.jl, DataFrames.jl etc. This makes it easier to build reproducible, modular workflows within the Julia ecosystem.

  6. +
+

These goals lay the foundation for future JuliaHealth tooling, making OMOP CDM data easier to validate, preprocess, and use in reproducible health data science workflows.

+
+
+

Tasks

+
+

1. Core HealthTable Interface with Tables.jl Connection

+

A major part of this project was introducing a new type called HealthTable, which makes it easier to work with OMOP CDM tables in Julia in a reliable and standardized way.

+
+

What is HealthTable?

+

HealthTable is a wrapper around a Julia DataFrame that:

+
    +
  • Validates your OMOP CDM data against the official OMOP CDM schema
  • +
  • Connects your data to Julia’s Tables.jl interface, so you can use it with any table-compatible package (like DataFrames.jl etc.)
  • +
  • Attaches metadata about each column (e.g., what kind of concept it represents)
  • +
  • Gives detailed error messages if your data doesn’t follow the expected OMOP CDM format
  • +
  • Uses PrettyTables.jl under the hood to display the table in a clean and readable format in the REPL or Jupyter notebooks
  • +
+
+
+

How is it defined?

+

The type is defined using Julia’s @kwdef macro to allow keyword-based construction:

+
@kwdef struct HealthTable{T}
+    source::T
+end
+
+Tables.schema(ht)                     # View schema (column names and types)
+Tables.rows(ht)                       # Iterate over rows
+Tables.columns(ht)                    # Access columns as named tuples
+Tables.materializer(typeof(ht))       # Used to materialize tables
+

Here, source is the original validated data (usually a DataFrame), and all logic is built around enforcing the schema and providing utilities on top of this source. Once wrapped in a HealthTable, you can interact with it using the complete Tables.jl interface. This makes HealthTable compatible with the entire Julia Tables ecosystem, meaning users can directly use OMOP CDM data with existing tooling, without needing custom adapters.

+
+
+

Creating a HealthTable

+
using DataFrames, OMOPCommonDataModel, InlineStrings, Serialization, Dates, FeatureTransforms, DBInterface, DuckDB
+using HealthBase
+
+df = DataFrame(
+    person_id = [1, 2, 3, 4],
+    gender_concept_id = [8507, 8532, 8507, 8532],
+    year_of_birth = [1990, 1985, 1992, 1980],
+    race_concept_id = [8527, 8516, 8527, 8527],
+)
+
+ht = HealthTable(df; omop_cdm_version="v5.4.1")
+

+
+HealthTable Extension Loading +
+
+
+ HealthTable +
+
+
+

Accessing the Source

+

Each HealthTable instance stores the original validated data in ht.source:

+
typeof(ht.source)                           # DataFrame
+metadata(ht.source, "omop_cdm_version")     # "v5.4.1"
+colmetadata(ht.source, :gender_concept_id)  # Per-column metadata
+
+
+

Handling Invalid Tables

+

If the input table does not conform to the expected OMOP CDM schema - for example, if a column has the wrong data type is present - the HealthTable constructor throws a detailed and user-friendly error. This makes it easier to catch mistakes early and ensures that only well-structured, validated data enters your pipeline. You can optionally relax this strict enforcement by setting disable_type_enforcement = true, in which case the constructor will emit warnings instead of throwing errors.

+
+
+
+

2. Preprocessing Functions

+

Once your data is wrapped in a HealthTable, you can apply several built-in preprocessing functions to prepare it for modeling. These functions help transform OMOP CDM data into formats that are more interpretable, compact, and ML-friendly.

+
+

Concept Mapping

+

Replaces concept IDs (e.g., 8507) with readable names (e.g., “Male”) using an OMOP CDM vocabulary table in DuckDB. If no schema is provided, the function will use the default main schema in the DuckDB connection. It supports both single and multiple columns, and allows custom naming for output columns. This greatly enhances data interpretability and allows models to work with semantically meaningful features.

+
conn = DBInterface.connect(DuckDB.DB, "path_to_file.duckdb")
+
+# Single column (adds gender_concept_id_mapped)
+ht_mapped = map_concepts(ht, :gender_concept_id, conn; schema = "schema_name")
+
+# Multiple columns, drop original
+ht_mapped2 = map_concepts(ht, [:gender_concept_id, :race_concept_id], conn;
+                          new_cols=["gender", "race"], drop_original=true)
+
+# In-place variant
+map_concepts!(ht, :gender_concept_id, conn; schema = "schema_name")
+
+
+ Map Concepts +
+


+

You can also perform concept mapping manually without relying on a database connection. This is especially useful for smaller datasets or when you already know the categorical values. For instance, by defining a custom dictionary such as Dict(8507 => “Male”, 8532 => “Female”), you can create a new column by mapping over the existing concept ID column using functions like Base.map or map!. This manual approach allows for flexible labeling and is particularly helpful during exploratory data analysis or early-stage prototyping when a full vocabulary table isn’t available or necessary.

+
+
+

Vocabulary Compression

+

Groups rare categorical values into an "Other" or a user-specified label to simplify high-cardinality columns. This is useful for reducing sparsity in features with many rare categories. The function calculates the frequency of each level and retains only those above a threshold, replacing the rest. It ensures that model inputs remain interpretable and avoids overfitting on underrepresented categories.

+
ht_compressed = apply_vocabulary_compression(ht_mapped; cols = [:race_concept_id],
+                    min_freq = 2, other_label = "Other")
+
+
+ Vocabulary Compression +
+
+
+

One-Hot Encoding

+

Converts a categorical column into binary columns. This prepares your features for ML models that can’t handle raw categorical values. The function works on both HealthTable and DataFrame objects and allows you to specify whether to drop the original column. It’s especially useful when working with encoded health conditions, demographics, or other categorical features.

+
ht_ohe = one_hot_encode(ht_compressed; cols = [:gender_concept_id, :race_concept_id])
+
+
+ One Hot Encoding +
+


+

Each of these preprocessing functions is designed to be both composable and schema-aware. This means you can mix and match transformations like one-hot encoding, vocabulary compression, and concept mapping depending on the needs of your workflow. You have the flexibility to work directly with a HealthTable or switch to a regular DataFrame when needed. This modular design ensures that your data processing steps are reproducible and consistent across different JuliaHealth tools, making health data analysis more efficient, reliable, and seamlessly integrated with Julia’s modern data ecosystem.

+
+
+
+
+

Contribution Beyond Coding

+
+

1. Organizing Meetings and Communication

+

Throughout the project, I had regular weekly meetings with my mentor, Jacob Zelko, where we discussed progress, clarified doubts, and made plans for the upcoming tasks. These sessions helped me better understand design decisions and refine my implementations with expert feedback. In addition to our meetings, I actively communicated via Zulip and Slack, where we discussed code behavior, errors, ideas, and other project-related decisions in detail. This consistent back-and-forth ensured a clear direction and rapid iteration.

+
+
+

2. Engaging with the JuliaHealth Ecosystem

+

Beyond contributing code to HealthBase.jl, I also engaged with the broader JuliaHealth ecosystem. After discussing with my mentor, I opened and contributed to issues in related JuliaHealth repositories identifying potential bugs and suggesting enhancements. These contributions aimed to improve the coherence and usability of JuliaHealth tools as a whole, not just within my assigned project.

+
+
+
+

Conclusion and Future Development

+

Contributing to HealthBase.jl during Phase 1 of GSoC has been a rewarding and insightful journey. It gave me the opportunity to dive deep into the structure of OMOP CDM, explore Julia’s composable interfaces like Tables.jl, and build features that directly support observational health workflows. Learning to design with extensibility in mind, especially when working with healthcare data has shaped how I now approach open-source problems.

+

Looking ahead, here are some of the directions I’d like to explore to further strengthen HealthBase.jl and its integration within the JuliaHealth ecosystem:

+
    +
  • Refine schema handling
    +Improve how internal schema checks reflect the structure of the underlying data source. This includes better alignment with OMOP CDM specifications and improved flexibility when dealing with edge cases or schema variations.

  • +
  • Strengthen Tables.jl integration
    +Enhance the robustness of how HealthTable interacts with the Tables.jl interface - ensuring better compatibility and reducing any overhead when working with downstream packages like DataFrames.jl.

  • +
  • Add new preprocessing functions
    +Extend the current toolkit by implementing more real-world utilities such as missing value imputation, cohort filtering etc.

  • +
  • Address related issues in the ecosystem
    +Collaborate with maintainers to help resolve open issues related to the project in OMOPCommonDataModel.jl, especially Issue #41 and Issue #40

  • +
+

Overall, this phase has not only improved the package but also helped me grow in terms of design thinking, working with abstractions, and contributing to a larger ecosystem. I’m looking forward to what comes next and to making HealthBase even more useful for the JuliaHealth community and beyond.

+
+
+

Acknowledgements

+

A big thank you to Jacob S. Zelko for being such a kind and thoughtful mentor throughout this project. His clear guidance, encouragement, and helpful feedback made a huge difference at every step. I’m also really thankful to the JuliaHealth community for creating such a welcoming and inspiring space to learn, build, and grow. It’s been a joy to be part of it.

+

Jacob S. Zelko: aka, TheCedarPrince

+

Note: This blog post was drafted with the assistance of LLM technologies to support grammar, clarity and structure.

+ + +
+ +

Citation

BibTeX citation:
@online{lakshmi_indu2025,
+  author = {Lakshmi Indu, Kosuri},
+  title = {GSoC ’25 {Phase} 1: {Enabling} {OMOP} {CDM} {Tables} and
+    {Preprocessing} in {HealthBase.jl}},
+  date = {2025-07-12},
+  langid = {en}
+}
+
For attribution, please cite this work as:
+Lakshmi Indu, Kosuri. 2025. “GSoC ’25 Phase 1: Enabling OMOP CDM +Tables and Preprocessing in HealthBase.jl.” July 12, 2025. +
+ + +
+
+ +
+ + + + + \ No newline at end of file diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png new file mode 100644 index 0000000..173a1f3 Binary files /dev/null and b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/healthtable.png differ diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png new file mode 100644 index 0000000..f791c63 Binary files /dev/null and b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/map_concepts.png differ diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png new file mode 100644 index 0000000..7a5ddc2 Binary files /dev/null and b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/one_hot_encoding.png differ diff --git a/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png new file mode 100644 index 0000000..3333aa9 Binary files /dev/null and b/docs/JuliaHealthBlog/posts/indu-gsoc-phase1/vocab_compression.png differ diff --git a/docs/JuliaHealthBlog/posts/indu-plp-part1/plp-part1.html b/docs/JuliaHealthBlog/posts/indu-plp-part1/plp-part1.html index 483f463..72d2bf1 100644 --- a/docs/JuliaHealthBlog/posts/indu-plp-part1/plp-part1.html +++ b/docs/JuliaHealthBlog/posts/indu-plp-part1/plp-part1.html @@ -2,7 +2,7 @@ - + @@ -24,9 +24,8 @@ vertical-align: middle; } /* CSS for syntax highlighting */ -html { -webkit-text-size-adjust: 100%; } pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } +pre > code.sourceCode > span { line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -37,7 +36,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } +pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -69,16 +68,15 @@ - - + - + - + + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i diff --git a/docs/JuliaHealthBlog/posts/jay-gsoc/gsoc-2024-fellows.html b/docs/JuliaHealthBlog/posts/jay-gsoc/gsoc-2024-fellows.html index 34e4856..db43a7b 100644 --- a/docs/JuliaHealthBlog/posts/jay-gsoc/gsoc-2024-fellows.html +++ b/docs/JuliaHealthBlog/posts/jay-gsoc/gsoc-2024-fellows.html @@ -2,7 +2,7 @@ - + @@ -24,9 +24,8 @@ vertical-align: middle; } /* CSS for syntax highlighting */ -html { -webkit-text-size-adjust: 100%; } pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } +pre > code.sourceCode > span { line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -37,7 +36,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } +pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -89,16 +88,15 @@ - - + - + - + + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i diff --git a/docs/index.html b/docs/index.html index b352be4..3a1c705 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2,7 +2,7 @@ - + @@ -31,16 +31,15 @@ - - + - + - + + }; + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); + for (var i=0; i