You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* move examples to examples directory
* move models to models module
* create args and data modules
* add train command to cli
* update submodules tests
* wip train system refactor
* continue train_system refactor, other misc refactoring
* data processing, train system, logging
* model saving, logging, display, tests
* update ruff version
* simplify build process with newer uv version
* fix lint/format errors
* fix pre-commit error
* run pre-commit on all files in `make check`
* fix a few type issues
* simplify train system invocation from cli
* add logfile path to display
* move new tests to test directory
* move notebooks to examples
* add todos for evaluation and prediction
* fix some more type issues
* support python 12 and 13
* consistent naming for args classes
* remove todos
* remove redundant set call
* move old code to `cnlpt.legacy`, rework docs
* remove some refactoring comments
* restore legacy code
* fix REST api data preprocessing
* fix `cnlpt train` help message
* set default python version to 3.13 and test newer python versions in CI
* update README.md
* minor version bumps in lockfile
* set min torch version to 2.6
* drop python 3.13 support due to windows issue
* minor args refactoring
* fix pin memory warning
* only disable mps for tests in CI
* refactor metrics, include transformers logging in logfile
* display name of best checkpoint
* update example READMEs
* support averaging multiple selection metrics
* add some data tests
* add prediction and evaluation to CnlpTrainSystem
* close logfile in train system test
* shutdown logging after train system test, cache HF models in CI
* use close instead of shutdown
* file handler is on root logger, not train system logger
* oops
* add tokens to CnlpPredictions
* json serialization for CnlpPredictions
* ensure "None" label has id 0 for relations tasks
* overwrite predictions file if `overwrite_output_dir` is true
* add data.analysis module
* fix metric averaging
* extract relations in analysis dataframe
* fix chemprot preprocessing
* fix broken import
* fix a couple training display issues
* temporarily remove results and error analysis stuff from chemprot readme
* add a --version arg for docker builds
Copy file name to clipboardExpand all lines: README.md
+38-38Lines changed: 38 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,23 +11,28 @@ Primary use cases include
11
11
This library is _not_ intended to serve as a place for clinical NLP applications to live. If you build something cool that uses transformer models that take advantage of our model definitions, the best practice is probably to rely on it as a library rather than treating it as your workspace. This library is also not intended as a deployment-ready tool for _scalable_ clinical NLP. There is a lot of interest in developing methods and tools that are smaller and can process millions of records, and this library can potentially be used for research along those line. But it will probably never be extremely optimized or shrink-wrapped for applications. However, there should be plenty of examples and useful code for people who are interested in that type of deployment.
12
12
13
13
## Install
14
-
> [!WARNING]
15
-
macOS support is currently experimental. We recommend using python3.10 for macOS installations.
16
14
17
-
> [!NOTE]
18
-
When installing the library's dependencies, `pip` will probably install
19
-
PyTorch with CUDA 10.2 support by default. If you would like to run the
20
-
library in CPU-only mode or with a newer version of CUDA, [install PyTorch
21
-
to your desired specifications](https://pytorch.org/get-started/locally/)
22
-
in your virtual environment first before installing `cnlp-transformers`.
15
+
> [!IMPORTANT]
16
+
> When installing the library's dependencies, PyTorch will probably be installed
17
+
> with CUDA 12.6 support by default on linux, and without CUDA support on other platforms.
18
+
> If you would like to run the library in CPU-only mode or with a specific version of CUDA,
19
+
> [install PyTorch to your desired specifications](https://pytorch.org/get-started/locally/)
20
+
> in your virtual environment first before installing `cnlp-transformers`.
21
+
> [See here](https://docs.astral.sh/uv/guides/integration/pytorch/#the-uv-pip-interface) if
22
+
> using uv.
23
23
24
24
### Static installation
25
25
26
26
If you are installing just to fine-tune or run the REST APIs,
27
-
you can install without cloning:
27
+
you can install without cloning using [uv](https://docs.astral.sh/uv/):
28
+
29
+
```sh
30
+
uv pip install cnlp-transformers
31
+
```
32
+
33
+
Or with pip:
28
34
29
35
```sh
30
-
# Note: if needed, install PyTorch first (see above)
31
36
pip install cnlp-transformers
32
37
```
33
38
@@ -110,18 +115,18 @@ We provided the following step-by-step examples how to finetune in clinical NLP
110
115
111
116
### Fine-tuning options
112
117
113
-
Run ```python -m cnlpt.train_system -h``` to see all the available options. In addition to inherited Huggingface Transformers options, there are options to do the following:
118
+
Run `cnlpt train -h` to see all the available options. In addition to inherited Huggingface Transformers options, there are options to do the following:
114
119
115
-
* Select different models:```--model hier``` uses a hierarchical transformer layer on top of a specified encoder model. We recommend using a very small encoder:```--encoder microsoft/xtremedistil-l6-h256-uncased``` so that the full model fits into memory.
120
+
* Select different models:`--model hier` uses a hierarchical transformer layer on top of a specified encoder model. We recommend using a very small encoder:`--encoder microsoft/xtremedistil-l6-h256-uncased` so that the full model fits into memory.
116
121
* Run simple baselines (use ``--model cnn|lstm --tokenizer_name roberta-base``-- since there is no HF model then you must specify the tokenizer explicitly)
117
-
* Use a different layer's CLS token for the classification (e.g., ```--layer 10```)
118
-
* Probabilistically freeze weights of the encoder (leaving classifier weights all unfrozen) (```--freeze``` alone freezes all encoder weights, ```--freeze <float>``` when given a parameter between 0 and 1, freezes that percentage of encoder weights)
119
-
* Classify based on a token embedding instead of the CLS embedding (```--token``` -- applies to the event/entity classification setting only, and requires the input to have xml-style tags (`<e>`, `</e>`) around the tokens representing the event/entity)
120
-
* Use class-weighted loss function (```--class_weights```)
122
+
* Use a different layer's CLS token for the classification (e.g., `--layer 10`)
123
+
* Probabilistically freeze weights of the encoder (leaving classifier weights all unfrozen) (`--freeze` alone freezes all encoder weights, `--freeze <float>` when given a parameter between 0 and 1, freezes that percentage of encoder weights)
124
+
* Classify based on a token embedding instead of the CLS embedding (`--token` -- applies to the event/entity classification setting only, and requires the input to have xml-style tags (`<e>`, `</e>`) around the tokens representing the event/entity)
125
+
* Use class-weighted loss function (`--class_weights`)
121
126
122
127
## Running REST APIs
123
128
124
-
There are existing REST APIs in the ```src/cnlpt/api``` folder for a few important clinical NLP tasks:
129
+
There are existing REST APIs in the `src/cnlpt/api` folder for a few important clinical NLP tasks:
125
130
126
131
1. Negation detection
127
132
2. Time expression tagging (spans + time classes)
@@ -133,7 +138,7 @@ There are existing REST APIs in the ```src/cnlpt/api``` folder for a few importa
133
138
To demo the negation API:
134
139
135
140
1. Install the `cnlp-transformers` package.
136
-
2. Run `cnlpt_negation_rest [-p PORT]`.
141
+
2. Run `cnlpt rest --model-type negation [-p PORT]`.
137
142
3. Open a python console and run the following commands:
138
143
139
144
#### Setup variables for negation
@@ -167,7 +172,7 @@ The model correctly classifies both nausea and anosmia as negated.
167
172
To demo the temporal API:
168
173
169
174
1. Install the `cnlp-transformers` package.
170
-
2. Run `cnlpt_temporal_rest [-p PORT]`
175
+
2. Run `cnlpt rest --model-type temporal [-p PORT]`
171
176
3. Open a python console and run the following commands to test:
172
177
173
178
#### Setup variables for temporal
@@ -217,20 +222,14 @@ should return:
217
222
218
223
This output indicates the token spans of events and timexes, and relations between events and timexes, where the suffixes are indices into the respective arrays (e.g., TIMEX-0 in a relation refers to the 0th time expression found, which begins at token 6 and ends at token 9 -- ["March 3, 2010"])
219
224
220
-
To run only the time expression or event taggers, change the run command to:
221
-
222
-
```uvicorn cnlpt.api.timex_rest:app --host 0.0.0.0``` or
then run the same process commands as above (including the same URL). You will get similar json output, but only one of the dictionary elements (timexes or events) will be populated.
227
-
228
225
## Citing cnlp_transformers
226
+
229
227
Please use the following bibtex to cite cnlp_transformers if you use it in a publication:
230
-
```
228
+
229
+
```latex
231
230
@misc{cnlp_transformers,
232
231
author = {CNLPT},
233
-
title = {Clinical {NLP} {Transformers} (cnlp\_transformers)},
232
+
title = {Clinical {NLP} {Transformers} (cnlp\_transformers)},
234
233
year = {2021},
235
234
publisher = {GitHub},
236
235
journal = {GitHub repository},
@@ -239,14 +238,15 @@ Please use the following bibtex to cite cnlp_transformers if you use it in a pub
239
238
```
240
239
241
240
## Publications using cnlp_transformers
241
+
242
242
Please send us any citations that used this library!
243
243
244
-
1. Chen S, Guevara M, Ramirez N, Murray A, Warner JL, Aerts HJWL, et al. Natural Language Processing to Automatically Extract the Presence and Severity of Esophagitis in Notes of Patients Undergoing Radiotherapy. JCO Clin Cancer Inform. 2023 Jul;(7):e2300048.
245
-
2. Li Y, Miller T, Bethard S, Savova G. Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information [Internet]. arXiv.org. 2024 [cited 2025 May 22]. Available from: https://arxiv.org/abs/2410.12774v1
246
-
3. Wang L, Li Y, Miller T, Bethard S, Savova G. Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [Internet]. Toronto, Canada: Association for Computational Linguistics; 2023 [cited 2025 May 22]. p. 15746–61. Available from: https://aclanthology.org/2023.acl-long.877/
247
-
4. Miller T, Bethard S, Dligach D, Savova G. End-to-end clinical temporal information extraction with multi-head attention. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:313–9.
248
-
5. Yoon W, Ren B, Thomas S, Kim C, Savova G, Hall MH, et al. Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction [Internet]. arXiv; 2025 [cited 2025 May 22]. Available from: http://arxiv.org/abs/2502.10388
249
-
6. Wang L, Zipursky AR, Geva A, McMurry AJ, Mandl KD, Miller TA. A computable case definition for patients with SARS-CoV2 testing that occurred outside the hospital. JAMIA Open. 2023 Oct 1;6(3):ooad047.
250
-
7. Bitterman DS, Goldner E, Finan S, Harris D, Durbin EB, Hochheiser H, et al. An End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts. Int J Radiat Oncol Biol Phys. 2023 Sep 1;117(1):262–73.
251
-
8. McMurry AJ, Gottlieb DI, Miller TA, Jones JR, Atreja A, Crago J, et al. Cumulus: A federated EHR-based learning system powered by FHIR and AI. medRxiv. 2024 Feb 6;2024.02.02.24301940.
252
-
9. LCD benchmark: long clinical document benchmark on mortality prediction for language models | Journal of the American Medical Informatics Association | Oxford Academic [Internet]. [cited 2025 Jan 23]. Available from: https://academic.oup.com/jamia/article-abstract/32/2/285/7909835?redirectedFrom=fulltext
244
+
1. Chen S, Guevara M, Ramirez N, Murray A, Warner JL, Aerts HJWL, et al. Natural Language Processing to Automatically Extract the Presence and Severity of Esophagitis in Notes of Patients Undergoing Radiotherapy. JCO Clin Cancer Inform. 2023 Jul;(7):e2300048.
245
+
2. Li Y, Miller T, Bethard S, Savova G. Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information [Internet]. arXiv.org. 2024 [cited 2025 May 22]. Available from: <https://arxiv.org/abs/2410.12774v1>
246
+
3. Wang L, Li Y, Miller T, Bethard S, Savova G. Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [Internet]. Toronto, Canada: Association for Computational Linguistics; 2023 [cited 2025 May 22]. p. 15746–61. Available from: <https://aclanthology.org/2023.acl-long.877/>
247
+
4. Miller T, Bethard S, Dligach D, Savova G. End-to-end clinical temporal information extraction with multi-head attention. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:313–9.
248
+
5. Yoon W, Ren B, Thomas S, Kim C, Savova G, Hall MH, et al. Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction [Internet]. arXiv; 2025 [cited 2025 May 22]. Available from: <http://arxiv.org/abs/2502.10388>
249
+
6. Wang L, Zipursky AR, Geva A, McMurry AJ, Mandl KD, Miller TA. A computable case definition for patients with SARS-CoV2 testing that occurred outside the hospital. JAMIA Open. 2023 Oct 1;6(3):ooad047.
250
+
7. Bitterman DS, Goldner E, Finan S, Harris D, Durbin EB, Hochheiser H, et al. An End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts. Int J Radiat Oncol Biol Phys. 2023 Sep 1;117(1):262–73.
251
+
8. McMurry AJ, Gottlieb DI, Miller TA, Jones JR, Atreja A, Crago J, et al. Cumulus: A federated EHR-based learning system powered by FHIR and AI. medRxiv. 2024 Feb 6;2024.02.02.24301940.
252
+
9. LCD benchmark: long clinical document benchmark on mortality prediction for language models | Journal of the American Medical Informatics Association | Oxford Academic [Internet]. [cited 2025 Jan 23]. Available from: <https://academic.oup.com/jamia/article-abstract/32/2/285/7909835?redirectedFrom=fulltext>
0 commit comments