You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+18-17Lines changed: 18 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Dockerizable source code for the baseline system for the [Chemotherapy Treatment
6
6
7
7
This is research code which depends on other research code. None of which is shrink wrapped. Run at your own risk and do not use in any kind of clinical decision making context.
8
8
9
-
While operational there are known bugs in the code's dependencies which are still being resolved.
9
+
While operational there are known issues in the code's dependencies which are still being resolved.
10
10
11
11
## Core dependencies
12
12
@@ -18,15 +18,15 @@ There are three main separate software packages that this code uses:
18
18
19
19
cTAKES contains several tools for text engineering and information extraction with a focus on clinical text, it makes heavy use of [Apache UIMA](https://uima.apache.org).
20
20
Within cTAKES the main module which drives this code is the cTAKES [Python Bridge to Java](https://github.com/apache/ctakes/tree/main/ctakes-pbj).
21
-
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows for use of python code to process text artifacts the same way one can do with Java within cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their annotated information between the relevant Java and Python processes using [DKPro cassis](https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ](https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
21
+
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows use of Python code to process text artifacts the same way one can do with Java code in cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their extracted information between the relevant Java and Python processes using [DKPro cassis](https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ](https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
22
22
23
-
Timenorm provides methods for identifying normalizing date and time expressions. We use a customized version (included as a maven module) where we change a heuristic for approximate dates.
23
+
Timenorm provides methods for identifying and normalizing date and time expressions. We use a customized version (included as a maven module) where we change a heuristic for approximate dates to better address the needs of the timelines project.
24
24
25
-
We used Huggingface Transformers for training the TLINK model, and use their [Pipelines interface](https://huggingface.co/docs/transformers/main_classes/pipelines) for loading the model for inference.
25
+
We used Huggingface Transformers for training the TLINK model, and use their [Pipelines interface](https://huggingface.co/docs/transformers/main_classes/pipelines) for loading the model for inference. We use the [Huggingface Hub](https://huggingface.co/HealthNLP) for model storage.
26
26
27
27
## Recommended Hardware
28
28
29
-
A CUDA capable GPU is preferable for running the TLINK classifier, but with sufficient memory the model can be run on CPU. Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU ( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
29
+
A CUDA capable GPU with at least 500mb of VRAM is preferred for running the TLINK classifier, but with sufficient standard RAM the model can be run on CPU. Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU ( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
30
30
```
31
31
deploy:
32
32
resources:
@@ -36,14 +36,15 @@ A CUDA capable GPU is preferable for running the TLINK classifier, but with suff
36
36
count: 1
37
37
capabilities: [gpu]
38
38
```
39
+
This also means you would not need the NVIDIA container toolkit.
39
40
40
41
## Classifiers
41
42
42
-
Our classifiers (currently only using TLINK) are accessible at https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface page.
43
+
Our classifiers (currently only using TLINK) are accessible at https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface repository.
43
44
44
45
## High-level system description
45
46
46
-
Each document is annotated with paragraphs, sentences, and tokens by cTAKES. The cTAKES dictionary module searches over the tokens for spans which match chemotherapy mentions in the annotated gold (in this regard we are using gold entities for chemos, although *not* for time expressions). Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them to ISO format. Finally, we create instances of chemotherapy and temporal expression pairs, and pass them to a PubMedBert-based classifier which identifies the temporal relationship between them as events. Finally the code outputs a file with all the classified instances organized by patient and filename, with unique identifiers for each chemotherapy mention and temporal expression.
47
+
Each document is annotated with paragraphs, sentences, and tokens by cTAKES. The cTAKES dictionary module searches over the tokens for spans which match chemotherapy mentions in the gold data annotations (in this regard we are using gold entities for chemos, although *not* for time expressions). Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them to ISO format. Finally, we create instances of chemotherapy and normalized date pairs, and pass them to a PubMedBert-based classifier which identifies the temporal relationship between the paired mentions. Finally the code outputs a file with all the classified instances organized by patient and filename, with unique identifiers for each chemotherapy mention and temporal expression.
47
48
48
49
49
50
## Overview of Docker dependencies
@@ -56,13 +57,13 @@ Each document is annotated with paragraphs, sentences, and tokens by cTAKES. Th
56
57
57
58
There are three mounted directories:
58
59
59
-
-*Input*: The collection of notes in a shared task cancer type cohort
60
+
-*Input*: The collection of notes in a shared task cancer type cohort
60
61
-*Processing*: Timeline information extraction over each note within cTAKES, aggregation of results by patient identifier
61
-
-*Output*: Aggregated unsummarized timelines information in a `tsv` file
62
+
-*Output*: Aggregated unsummarized timelines information in a `tsv` file
62
63
63
64
## Build a Docker image
64
65
65
-
Under the project root directory ( you may need to use `sudo` ):
66
+
Under the project root directory ( you may need to use `sudo` ) run:
66
67
67
68
```
68
69
docker compose build --no-cache
@@ -77,12 +78,12 @@ docker compose up
77
78
```
78
79
## Critical operation instruction
79
80
80
-
Due to a current bug in the inter process communication, the process will finish writing but not close itself. So when you see `Writing results for ...` followed by `Finished writing...` close the process via `ctrl+c`
81
+
Due to a current issue in the inter process communication, the process will finish writing but not close itself. So when you see `Writing results for ...` followed by `Finished writing...` close the process via `ctrl+c`. This is the case both for running the system inside or outside of a Docker image.
81
82
82
83
## Running the system outside of a Docker image
83
84
84
85
This is for the most part actually how we have ran the system during development, and can be resorted to in the event of issues with creating or running a Docker image. Use the following steps for setup:
85
-
- Make sure you have Java JDK 8 and the latest version of maven installed (we use OpenJDK) and is set as your default Java
86
+
- Make sure you have Java JDK 8 (we use OpenJDK) and the latest version of maven installed and that Java 8 is set as your default system Java
86
87
- Create a conda 3.9 environment with `conda create -n timelines python=3.9`
87
88
- Change directory into `timelines` under the project root
88
89
- Create an ActiveMQ broker named `mybroker` in your current directory via:
- Finally create the Java executable Jars with maven:
99
100
```
100
101
mvn -U clean package
101
102
```
102
-
If you run into issues with Torch, you might want to look into finding the Torch setup most appropriate for your configuration and install it via `conda`.
103
+
If you run into issues with Torch, you might want to look into finding the Torch setup [most appropriate for your configuration](https://pytorch.org/get-started/locally/) and install it via `conda`.
103
104
104
105
Finally, assuming everything compiled and your *input* folder is populated you can run the system via:
105
106
```
@@ -136,13 +137,13 @@ The file will have the columns:
The `org.apache.ctakes.core.pipeline.PiperFileRunner` class is the entry point. `-a mybroker` points to the ActiveMQ broker for the process (you can see how to set one up in the Dockerfile).
169
+
The `org.apache.ctakes.core.pipeline.PiperFileRunner` class is the entry point. `-a mybroker` points to the ActiveMQ broker for the process (you can see how to set one up in the Dockerfile).
0 commit comments