-
Notifications
You must be signed in to change notification settings - Fork 38
Description
I have a number of PDF lab reports that are from different chemical analysis laboratories, with similar information in different formats. These have been OCR'd to JSON format with the extracted text, its position, confidence scores etc. I'm then converting this to RDF at which point I need to enrich it/calculate a number of additional things: some can be done directly from the RDF, others will need to be piped through some external process such as human review, an LLM, ML models etc.
As this is a proof of concept I am currently building most of this logic into SPARQL queries. With more time I would break these into more granular rules, which would then be grouped for a particular lab's format.
The main things I would be looking for when using rules would be:
- ability to have granular versioned rules and record provenance for execution (time etc.)
- ability to group rules
Some current example SPARQL:
INSERT {
<cellID> ex:hasChemName ?Original_Chem_Name .
<cellID> ex:hasTextResult ?Text_Result .
...
}
WHERE {
...
BIND(REPLACE(?chem_name_raw, "N\\d{2}$", "") AS ?Original_Chem_Name)
BIND(REPLACE(?value_raw, "^N\\d{2}", "") AS ?Text_Result)
BIND(
IF(REGEX(STR(?value_raw), "^N\\d{2}"),
SUBSTR(STR(?value_raw), 1, 3), # NXX
""
) AS ?NXX
)
BIND(REPLACE(?Text_Result, "^< ", "") AS ?Result)
BIND(IF(STRSTARTS(?Text_Result, "< "), "<", "") AS ?Prefix)
BIND(STRLEN(
REPLACE(
REPLACE(STR(?Result), "\\.", ""), # 1. remove decimal point
"^0+", "" # 2. strip all leading zeros
)
) AS ?Result_Sig_Figs)
}
Other examples which can be derived from the RDF include flagging certain values for human review, based on thresholds and other context in the table.
Example lab report the data is extracted from:
This all seems within the scope of the current draft to me. The versioning can be additional triples which I manage, and the provenance can be built into the rule logic itself.