Use case: lab report enrichment

I have a number of PDF lab reports that are from different chemical analysis laboratories, with similar information in different formats. These have been OCR'd to JSON format with the extracted text, its position, confidence scores etc. I'm then converting this to RDF at which point I need to enrich it/calculate a number of additional things: some can be done directly from the RDF, others will need to be piped through some external process such as human review, an LLM, ML models etc.

As this is a proof of concept I am currently building most of this logic into SPARQL queries. With more time I would break these into more granular rules, which would then be grouped for a particular lab's format.
 
The main things I would be looking for when using rules would be:
- ability to have granular versioned rules and record provenance for execution (time etc.)
- ability to group rules

Some current example SPARQL:
```sparql
INSERT {
<cellID> ex:hasChemName ?Original_Chem_Name .
<cellID> ex:hasTextResult ?Text_Result .
...
}
WHERE {
...
  BIND(REPLACE(?chem_name_raw, "N\\d{2}$", "") AS ?Original_Chem_Name)
  BIND(REPLACE(?value_raw, "^N\\d{2}", "") AS ?Text_Result)
  BIND(
  IF(REGEX(STR(?value_raw), "^N\\d{2}"),
     SUBSTR(STR(?value_raw), 1, 3),  # NXX
     ""
  ) AS ?NXX
)
  BIND(REPLACE(?Text_Result, "^< ", "") AS ?Result)
  BIND(IF(STRSTARTS(?Text_Result, "< "), "<", "") AS ?Prefix)
  BIND(STRLEN(
          REPLACE(
            REPLACE(STR(?Result), "\\.", ""),  # 1. remove decimal point
            "^0+", ""                       # 2. strip all leading zeros
          )
       ) AS ?Result_Sig_Figs)
}
```
Other examples which can be derived from the RDF include flagging certain values for human review, based on thresholds and other context in the table.

Example lab report the data is extracted from:
<img width="1677" height="1026" alt="Image" src="https://github.com/user-attachments/assets/fdd94cd7-7605-4459-8aea-934307a872de" />

This all seems within the scope of the current draft to me. The versioning can be additional triples which I manage, and the provenance can be built into the rule logic itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use case: lab report enrichment #584

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use case: lab report enrichment #584

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions