Skip to content

Commit 2e42864

Browse files
committed
fixed newline formatting
1 parent e445ac5 commit 2e42864

File tree

1 file changed

+19
-39
lines changed

1 file changed

+19
-39
lines changed

README.md

Lines changed: 19 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,17 @@ There are three main separate software packages that this code uses:
1616
- [Huggingface Transformers](https://huggingface.co/docs/transformers/index)
1717

1818

19-
cTAKES contains several tools for text engineering and information extraction with a focus on clinical text, it is based on [Apache UIMA](https://uima.apache.org).
19+
cTAKES contains several tools for text engineering and information extraction with a focus on clinical text, it makes heavy use of [Apache UIMA](https://uima.apache.org).
2020
Within cTAKES the main module which drives this code is the cTAKES [Python Bridge to Java](https://github.com/apache/ctakes/tree/main/ctakes-pbj).
21-
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows for use of python code to process text artifacts the same way one can do
22-
with Java within cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their annotated information between the relevant Java and Python processes
23-
using [DKPro cassis]( https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ]( https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
21+
While cTAKES is written in Java, the Python Bridge to Java (*ctakes-pbj*) allows for use of python code to process text artifacts the same way one can do with Java within cTAKES. *ctakes-pbj* accomplishes this by passing text artifacts and their annotated information between the relevant Java and Python processes using [DKPro cassis]( https://github.com/dkpro/dkpro-cassis) for serialization, [Apache ActiveMQ]( https://activemq.apache.org) for message brokering, and [stomp.py](https://github.com/jasonrbriggs/stomp.py) for Python-side receipt from and transmission to ActiveMQ.
2422

2523
Timenorm provides methods for identifying normalizing date and time expressions. We use a customized version (included as a maven module) where we change a heuristic for approximate dates.
2624

2725
We used Huggingface Transformers for training the TLINK model, and use their [Pipelines interface](https://huggingface.co/docs/transformers/main_classes/pipelines) for loading the model for inference.
2826

2927
## Recommended Hardware
3028

31-
A CUDA capable GPU is preferable for running the TLINK classifier, but with sufficient memory the model can be run on CPU.
32-
Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU
33-
( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
29+
A CUDA capable GPU is preferable for running the TLINK classifier, but with sufficient memory the model can be run on CPU. Outside of the Docker nothing needs to be done to effect this change, if however you want to run the Docker on a machine with no GPU ( or to disable GPU use ) then comment out the following lines in `docker-compose.yml`:
3430
```
3531
deploy:
3632
resources:
@@ -43,19 +39,11 @@ Outside of the Docker nothing needs to be done to effect this change, if however
4339

4440
## Classifiers
4541

46-
Our classifiers (currently only using TLINK) are accessible at
47-
https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface page.
42+
Our classifiers (currently only using TLINK) are accessible at https://huggingface.co/HealthNLP. By default the code downloads and loads the TLINK classifier from the Huggingface page.
4843

4944
## High-level system description
5045

51-
Each document is annotated with paragraphs, sentences, and tokens by cTAKES.
52-
The cTAKES dictionary module searches over the tokens for spans which match
53-
chemotherapy mentions in the annotated gold (in this regard we are using gold entities for chemos, although *not* for time expressions).
54-
Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them
55-
to ISO format. Finally, we create instances of chemotherapy and temporal expression pairs, and pass them to a PubMedBert-based classifier
56-
which identifies the temporal relationship between them as events.
57-
Finally the code outputs a file with all the classified instances organized by patient and filename,
58-
with unique identifiers for each chemotherapy mention and temporal expression.
46+
Each document is annotated with paragraphs, sentences, and tokens by cTAKES. The cTAKES dictionary module searches over the tokens for spans which match chemotherapy mentions in the annotated gold (in this regard we are using gold entities for chemos, although *not* for time expressions). Then a cTAKES SVM-based tagger finds token spans which correspond to temporal expressions, and we use Timenorm to normalize them to ISO format. Finally, we create instances of chemotherapy and temporal expression pairs, and pass them to a PubMedBert-based classifier which identifies the temporal relationship between them as events. Finally the code outputs a file with all the classified instances organized by patient and filename, with unique identifiers for each chemotherapy mention and temporal expression.
5947

6048

6149
## Overview of Docker dependencies
@@ -64,16 +52,15 @@ with unique identifiers for each chemotherapy mention and temporal expression.
6452
- [Docker Compose](https://docs.docker.com/compose/install/)
6553
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
6654

67-
The image (by default) requires three folders for the TLINK, DTR, and contexual modality Huggingface classifiers at the top of the repository with the respective names `tlink`, `dtr`, and `conmod`.
68-
6955
## Data directories
7056

7157
There are three mounted directories:
7258

7359
- *Input*: The collection of notes in a shared task cancer type cohort
7460
- *Processing*: Timeline information extraction over each note within cTAKES, aggregation of results by patient identifier
7561
- *Output*: Aggregated unsummarized timelines information in a `tsv` file
76-
## Build docker image
62+
63+
## Build a Docker image
7764

7865
Under the project root directory ( you may need to use `sudo` ):
7966

@@ -90,32 +77,29 @@ docker compose up
9077
```
9178
## Critical operation instruction
9279

93-
Due to a current bug in the inter process communication, the process will finish writing but not close itself.
94-
So when you see `Writing results for ...` followed by `Finished writing...` close the process via `Ctrl+c`
80+
Due to a current bug in the inter process communication, the process will finish writing but not close itself. So when you see `Writing results for ...` followed by `Finished writing...` close the process via `ctrl+c`
81+
9582
## Running the system outside of a Docker image
9683

97-
This is for the most part actually how we have ran the system during development,
98-
and can be resorted to in the event of issues with creating or running a Docker image.
99-
Use the following steps for setup:
84+
This is for the most part actually how we have ran the system during development, and can be resorted to in the event of issues with creating or running a Docker image. Use the following steps for setup:
10085
- Make sure you have Java JDK 8 and the latest version of maven installed (we use OpenJDK) and is set as your default Java
10186
- Create a conda 3.9 environment with `conda create -n timelines python=3.9`
10287
- Change directory into `timelines` under the project root
10388
- Create an ActiveMQ broker named `mybroker` in your current directory via:
10489
```
10590
curl -LO https://archive.apache.org/dist/activemq/activemq-artemis/2.19.1/apache-artemis-2.19.1-bin.zip && \
10691
unzip apache-artemis-2.19.1-bin.zip && \
107-
apache-artemis-2.19.1/bin/artemis create mybroker --user deepphe --password deepphe --allow-anonymous && \
92+
apache-artemis-2.19.1/bin/artemis create mybroker --user deepphe --password deepphe --allow-anonymous
10893
```
109-
- (temporary fix until we fix the PBJ and timelines dependencies issue) Install the Python dependencies via
94+
- (temporary fix until we fix the PBJ and timelines dependencies issue) Install the Python dependencies via:
11095
```
11196
pip install stomp.py dkpro-cassis transformers[torch] pandas tomli setuptools
11297
```
11398
- Finally create the Java executable Jars with maven:
11499
```
115100
mvn -U clean package
116101
```
117-
If you run into issues with Torch, you might want to look into finding the Torch setup most appropriate for your
118-
configuration and install it via `conda`.
102+
If you run into issues with Torch, you might want to look into finding the Torch setup most appropriate for your configuration and install it via `conda`.
119103

120104
Finally, assuming everything compiled and your *input* folder is populated you can run the system via:
121105
```
@@ -133,8 +117,7 @@ java -cp instance-generator/target/instance-generator-5.0.0-SNAPSHOT-jar-with-de
133117
-l org/apache/ctakes/dictionary/lookup/fast/bsv/Unified_Gold_Dev.xml \
134118
--pipPbj yes \
135119
```
136-
And assuming successful processing your `tsv` file should be in the *output* folder of the project root directory.
137-
Finally, stop the ActiveMQ via:
120+
And assuming successful processing your `tsv` file should be in the *output* folder of the project root directory. Finally, stop the ActiveMQ via:
138121
```
139122
mybroker/bin/artemis stop
140123
```
@@ -145,8 +128,7 @@ folder will take the form of a collection of notes comprising all the patients o
145128
```
146129
<patient identifier>_<four digit year>_<two digit month>_<two digit date>
147130
```
148-
Where the year month and date correspond to the creation time of the file.
149-
All the files in the shared task dataset follow this schema so for our data there is nothing you need to do.
131+
Where the year month and date correspond to the creation time of the file. All the files in the shared task dataset follow this schema so for our data there is nothing you need to do.
150132

151133
Assuming successful processing, the output file will be a tab separated value (`tsv`) file in the `output` folder.
152134
The file will have the columns:
@@ -164,7 +146,6 @@ And each row corresponds to a TLINK classification instance from a given file.
164146
- `tlink_inst` holds the full chemotherapy timex pairing instance that was fed to the classifier (mostly for debugging purposes)
165147

166148

167-
168149
## Architecture
169150

170151
We use two maven modules, one for the Java and Python annotators relevant to processing the clinical notes, and the other which has the customized version of Timenorm. There are not many files and their classpaths are not especially important for understanding, but more so for customization.
@@ -242,7 +223,7 @@ set TimelinesSecondStep=timelines.timelines_pipeline
242223
243224
add PythonRunner Command="-m $TimelinesSecondStep -rq JavaToPy -o $OutputDirectory"
244225
```
245-
This starts the Python annotator and has it wait on the ArtemisMQ receive queue for incoming CASes.
226+
This starts the Python annotator and has it wait on the ArtemisMQ receive queue for incoming CASes.
246227
```
247228
set minimumSpan=2
248229
set exclusionTags=“”
@@ -257,7 +238,7 @@ load DefaultTokenizerPipeline
257238
add ContextDependentTokenizerAnnotator
258239
load DictionarySubPipe
259240
```
260-
`minimumSpan` and `exclusionTags` are both configuration parameters for the dictionary lookup module, we don't exclude any parts of speech for lookup and want only to retrieve turns of at least two characters. The `DefaultTokenizerPipeline` annotates each CAS for paragraphs, sentences, and tokens. The `ContextDependentTokenizerAnnotator` depends on annotated base tokens and identifies basic numerical expressions for dates and times. The `DictionarySubPipe` module loads the dictionary configuration XML provided with the `-l` tag in the execution of the main Jar file.
241+
`minimumSpan` and `exclusionTags` are both configuration parameters for the dictionary lookup module, we don't exclude any parts of speech for lookup and want only to retrieve turns of at least two characters. The `DefaultTokenizerPipeline` annotates each CAS for paragraphs, sentences, and tokens. The `ContextDependentTokenizerAnnotator` depends on annotated base tokens and identifies basic numerical expressions for dates and times. The `DictionarySubPipe` module loads the dictionary configuration XML provided with the `-l` tag in the execution of the main Jar file.
261242
```
262243
add BackwardsTimeAnnotator classifierJarPath=/org/apache/ctakes/temporal/models/timeannotator/model.jar
263244
add DCTAnnotator
@@ -273,12 +254,11 @@ Sends the CASes which have been processed by the Java annotators to the Python a
273254

274255
## Core Python processing annotator
275256

276-
The core Python logic is in the file
257+
The core Python logic is in the file:
277258
```
278259
timelines/instance-generator/src/user/resources/org/apache/ctakes/timelines/timelines_py/src/timelines/timelines_delegator.py
279260
```
280-
Like the Java annotators the Python annotator implements a `process` method which is the core driver of the annotator for processing each note's contents.
281-
The raw output for the whole cancer type cohort is collected and written to TSV on disk in the `collection_process_complete` method.
261+
Like the Java annotators the Python annotator implements a `process` method which is the core driver of the annotator for processing each note's contents. The raw output for the whole cancer type cohort is collected and written to TSV on disk in the `collection_process_complete` method.
282262

283263
## Questions and technical issues
284264

0 commit comments

Comments
 (0)