Skip to content

Use case: lab report enrichment #584

@recalcitrantsupplant

Description

@recalcitrantsupplant

I have a number of PDF lab reports that are from different chemical analysis laboratories, with similar information in different formats. These have been OCR'd to JSON format with the extracted text, its position, confidence scores etc. I'm then converting this to RDF at which point I need to enrich it/calculate a number of additional things: some can be done directly from the RDF, others will need to be piped through some external process such as human review, an LLM, ML models etc.

As this is a proof of concept I am currently building most of this logic into SPARQL queries. With more time I would break these into more granular rules, which would then be grouped for a particular lab's format.

The main things I would be looking for when using rules would be:

  • ability to have granular versioned rules and record provenance for execution (time etc.)
  • ability to group rules

Some current example SPARQL:

INSERT {
<cellID> ex:hasChemName ?Original_Chem_Name .
<cellID> ex:hasTextResult ?Text_Result .
...
}
WHERE {
...
  BIND(REPLACE(?chem_name_raw, "N\\d{2}$", "") AS ?Original_Chem_Name)
  BIND(REPLACE(?value_raw, "^N\\d{2}", "") AS ?Text_Result)
  BIND(
  IF(REGEX(STR(?value_raw), "^N\\d{2}"),
     SUBSTR(STR(?value_raw), 1, 3),  # NXX
     ""
  ) AS ?NXX
)
  BIND(REPLACE(?Text_Result, "^< ", "") AS ?Result)
  BIND(IF(STRSTARTS(?Text_Result, "< "), "<", "") AS ?Prefix)
  BIND(STRLEN(
          REPLACE(
            REPLACE(STR(?Result), "\\.", ""),  # 1. remove decimal point
            "^0+", ""                       # 2. strip all leading zeros
          )
       ) AS ?Result_Sig_Figs)
}

Other examples which can be derived from the RDF include flagging certain values for human review, based on thresholds and other context in the table.

Example lab report the data is extracted from:
Image

This all seems within the scope of the current draft to me. The versioning can be additional triples which I manage, and the provenance can be built into the rule logic itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RulesFor SHACL 1.2 Rules spec.UCRUse Cases and Requirements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions