Skip to content

Commit 8410952

Browse files
committed
Wording and typo fixes
1 parent 2e42864 commit 8410952

File tree

1 file changed

+18
-17
lines changed

1 file changed

+18
-17
lines changed

README.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Dockerizable source code for the baseline system for the [Chemotherapy Treatment
66

77
This is research code which depends on other research code. None of which is shrink wrapped. Run at your own risk and do not use in any kind of clinical decision making context.
88

9-
While operational there are known bugs in the code's dependencies which are still being resolved.
9+
While operational there are known issues in the code's dependencies which are still being resolved.
1010

1111
## Core dependencies
1212

@@ -18,15 +18,15 @@ There are three main separate software packages that this code uses:
1818

1919
cTAKES contains several tools for text engineering and information extraction with a focus on clinical text, it makes heavy use of [Apache UIMA](https://uima.apache.org).
2020
Within cTAKES the main module which drives this code is the cTAKES [Python Bridge to Java](https://github.com/apache/ctakes/tree/main/ctakes-pbj).
21-
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows for use of python code to process text artifacts the same way one can do with Java within cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their annotated information between the relevant Java and Python processes using [DKPro cassis]( https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ]( https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
21+
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows use of Python code to process text artifacts the same way one can do with Java code in cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their extracted information between the relevant Java and Python processes using [DKPro cassis]( https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ]( https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
2222

23-
Timenorm provides methods for identifying normalizing date and time expressions. We use a customized version (included as a maven module) where we change a heuristic for approximate dates.
23+
Timenorm provides methods for identifying and normalizing date and time expressions. We use a customized version (included as a maven module) where we change a heuristic for approximate dates to better address the needs of the timelines project.
2424

25-
We used Huggingface Transformers for training the TLINK model, and use their [Pipelines interface](https://huggingface.co/docs/transformers/main_classes/pipelines) for loading the model for inference.
25+
We used Huggingface Transformers for training the TLINK model, and use their [Pipelines interface](https://huggingface.co/docs/transformers/main_classes/pipelines) for loading the model for inference. We use the [Huggingface Hub](https://huggingface.co/HealthNLP) for model storage.
2626

2727
## Recommended Hardware
2828

29-
A CUDA capable GPU is preferable for running the TLINK classifier, but with sufficient memory the model can be run on CPU. Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU ( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
29+
A CUDA capable GPU with at least 500mb of VRAM is preferred for running the TLINK classifier, but with sufficient standard RAM the model can be run on CPU. Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU ( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
3030
```
3131
deploy:
3232
resources:
@@ -36,14 +36,15 @@ A CUDA capable GPU is preferable for running the TLINK classifier, but with suff
3636
count: 1
3737
capabilities: [gpu]
3838
```
39+
This also means you would not need the NVIDIA container toolkit.
3940

4041
## Classifiers
4142

42-
Our classifiers (currently only using TLINK) are accessible at https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface page.
43+
Our classifiers (currently only using TLINK) are accessible at https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface repository.
4344

4445
## High-level system description
4546

46-
Each document is annotated with paragraphs, sentences, and tokens by cTAKES. The cTAKES dictionary module searches over the tokens for spans which match chemotherapy mentions in the annotated gold (in this regard we are using gold entities for chemos, although *not* for time expressions). Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them to ISO format. Finally, we create instances of chemotherapy and temporal expression pairs, and pass them to a PubMedBert-based classifier which identifies the temporal relationship between them as events. Finally the code outputs a file with all the classified instances organized by patient and filename, with unique identifiers for each chemotherapy mention and temporal expression.
47+
Each document is annotated with paragraphs, sentences, and tokens by cTAKES. The cTAKES dictionary module searches over the tokens for spans which match chemotherapy mentions in the gold data annotations (in this regard we are using gold entities for chemos, although *not* for time expressions). Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them to ISO format. Finally, we create instances of chemotherapy and normalized date pairs, and pass them to a PubMedBert-based classifier which identifies the temporal relationship between the paired mentions. Finally the code outputs a file with all the classified instances organized by patient and filename, with unique identifiers for each chemotherapy mention and temporal expression.
4748

4849

4950
## Overview of Docker dependencies
@@ -56,13 +57,13 @@ Each document is annotated with paragraphs, sentences, and tokens by cTAKES. Th
5657

5758
There are three mounted directories:
5859

59-
- *Input*: The collection of notes in a shared task cancer type cohort
60+
- *Input*: The collection of notes in a shared task cancer type cohort
6061
- *Processing*: Timeline information extraction over each note within cTAKES, aggregation of results by patient identifier
61-
- *Output*: Aggregated unsummarized timelines information in a `tsv` file
62+
- *Output*: Aggregated unsummarized timelines information in a `tsv` file
6263

6364
## Build a Docker image
6465

65-
Under the project root directory ( you may need to use `sudo` ):
66+
Under the project root directory ( you may need to use `sudo` ) run:
6667

6768
```
6869
docker compose build --no-cache
@@ -77,12 +78,12 @@ docker compose up
7778
```
7879
## Critical operation instruction
7980

80-
Due to a current bug in the inter process communication, the process will finish writing but not close itself. So when you see `Writing results for ...` followed by `Finished writing...` close the process via `ctrl+c`
81+
Due to a current issue in the inter process communication, the process will finish writing but not close itself. So when you see `Writing results for ...` followed by `Finished writing...` close the process via `ctrl+c`. This is the case both for running the system inside or outside of a Docker image.
8182

8283
## Running the system outside of a Docker image
8384

8485
This is for the most part actually how we have ran the system during development, and can be resorted to in the event of issues with creating or running a Docker image. Use the following steps for setup:
85-
- Make sure you have Java JDK 8 and the latest version of maven installed (we use OpenJDK) and is set as your default Java
86+
- Make sure you have Java JDK 8 (we use OpenJDK) and the latest version of maven installed and that Java 8 is set as your default system Java
8687
- Create a conda 3.9 environment with `conda create -n timelines python=3.9`
8788
- Change directory into `timelines` under the project root
8889
- Create an ActiveMQ broker named `mybroker` in your current directory via:
@@ -91,15 +92,15 @@ curl -LO https://archive.apache.org/dist/activemq/activemq-artemis/2.19.1/apache
9192
unzip apache-artemis-2.19.1-bin.zip && \
9293
apache-artemis-2.19.1/bin/artemis create mybroker --user deepphe --password deepphe --allow-anonymous
9394
```
94-
- (temporary fix until we fix the PBJ and timelines dependencies issue) Install the Python dependencies via:
95+
- (temporary fix until we fix the PBJ and timelines dependencies issue) Install the system's Python dependencies via:
9596
```
9697
pip install stomp.py dkpro-cassis transformers[torch] pandas tomli setuptools
9798
```
9899
- Finally create the Java executable Jars with maven:
99100
```
100101
mvn -U clean package
101102
```
102-
If you run into issues with Torch, you might want to look into finding the Torch setup most appropriate for your configuration and install it via `conda`.
103+
If you run into issues with Torch, you might want to look into finding the Torch setup [most appropriate for your configuration](https://pytorch.org/get-started/locally/) and install it via `conda`.
103104

104105
Finally, assuming everything compiled and your *input* folder is populated you can run the system via:
105106
```
@@ -136,13 +137,13 @@ The file will have the columns:
136137
DCT patient_id chemo_text chemo_annotation_id normed_timex timex_annotation_id tlink note_name tlink_inst
137138
```
138139
And each row corresponds to a TLINK classification instance from a given file. In each row:
139-
- The `DCT` cell will hold the document creation time/date of the file which is the source of the instance
140+
- The `DCT` cell will hold the document creation date of the file which is the source of the instance
140141
- The `patient_id` cell will hold the patient identifier of the file which is the source of the instance
141142
- `chemo_text` cell will hold the raw text of the chemotherapy mention in the instance as it appears in the note
142143
- `chemo_annotation_id` assigns the chemotherapy mention in the previous cell a unique identifier (at the token rather than the type level)
143144
- `normed_timex` will hold the normalized version of the time expression in the tlink instance
144145
- `timex_annotation_id` assigns the time expression in the previous cell a unique identifier (at the token rather than the type level)
145-
- `note_name` holds the name of the corresponding file (technically redundant if your files correspond to specification)
146+
- `note_name` holds the name of the corresponding file
146147
- `tlink_inst` holds the full chemotherapy timex pairing instance that was fed to the classifier (mostly for debugging purposes)
147148

148149

@@ -165,7 +166,7 @@ java -cp instance-generator/target/instance-generator-5.0.0-SNAPSHOT-jar-with-de
165166
-l org/apache/ctakes/dictionary/lookup/fast/bsv/Unified_Gold_Dev.xml \
166167
--pipPbj yes \
167168
```
168-
The `org.apache.ctakes.core.pipeline.PiperFileRunner` class is the entry point. `-a mybroker` points to the ActiveMQ broker for the process (you can see how to set one up in the Dockerfile).
169+
The `org.apache.ctakes.core.pipeline.PiperFileRunner` class is the entry point. `-a mybroker` points to the ActiveMQ broker for the process (you can see how to set one up in the Dockerfile).
169170

170171
## The piper file
171172

0 commit comments

Comments
 (0)