Machine-Learning-for-Medical-Language
diff --git a/‎README.md‎
Lines changed: 33 additions & 85 deletions b/‎README.md‎
Lines changed: 33 additions & 85 deletions
diff --git a/‎docker/model_download.py‎
Lines changed: 1 addition & 1 deletion b/‎docker/model_download.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎examples/.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/chemprot/README.md‎
Lines changed: 11 additions & 13 deletions b/‎examples/chemprot/README.md‎
Lines changed: 11 additions & 13 deletions
diff --git a/‎examples/chemprot/preprocess_chemprot.py‎
Lines changed: 13 additions & 15 deletions b/‎examples/chemprot/preprocess_chemprot.py‎
Lines changed: 13 additions & 15 deletions
diff --git a/‎examples/uci_drug/README.md‎
Lines changed: 55 additions & 63 deletions b/‎examples/uci_drug/README.md‎
Lines changed: 55 additions & 63 deletions
@@ -103,7 +103,7 @@ To use the library for fine-tuning, you'll need to take the following steps:
 
       Instance labels should be formatted the same way as in the csv/tsv example above, see specifically the formats for tagging and relations. The 'metadata' field can either be included in the train/dev/test files or as a separate metadata.json file.
 
-2. Run train_system.py with a ```--task_name``` from your data files and the ```--data-dir``` argument from Step 1. If no ```--task_name``` is provided, all tasks will be trained.
+2. Run train_system.py with a `--model-type` (one of `cnn`, `lstm`, `hier`, or `proj`), and a `--data-dir` (path to the folder you created in step 1). Optionally specify one or more `--task` names to train on. By default all tasks will be trained.
 
 ### Step-by-step finetuning examples
 
@@ -115,113 +115,61 @@ We provided the following step-by-step examples how to finetune in clinical NLP
 
 ### Fine-tuning options
 
-Run `cnlpt train -h` to see all the available options. In addition to inherited Huggingface Transformers options, there are options to do the following:
+Run `cnlpt train --help` to see all the available options. In addition to inherited Huggingface Transformers options, there are options to do the following:
 
 * Select different models: `--model hier` uses a hierarchical transformer layer on top of a specified encoder model. We recommend using a very small encoder: `--encoder microsoft/xtremedistil-l6-h256-uncased` so that the full model fits into memory.
-* Run simple baselines (use ``--model cnn|lstm --tokenizer_name roberta-base`` -- since there is no HF model then you must specify the tokenizer explicitly)
+* Run simple baselines (use `--model cnn|lstm --tokenizer roberta-base` -- since there is no HF model then you must specify the tokenizer explicitly)
 * Use a different layer's CLS token for the classification (e.g., `--layer 10`)
 * Probabilistically freeze weights of the encoder (leaving classifier weights all unfrozen) (`--freeze` alone freezes all encoder weights, `--freeze <float>` when given a parameter between 0 and 1, freezes that percentage of encoder weights)
 * Classify based on a token embedding instead of the CLS embedding (`--token` -- applies to the event/entity classification setting only, and requires the input to have xml-style tags (`<e>`, `</e>`) around the tokens representing the event/entity)
 * Use class-weighted loss function (`--class_weights`)
 
 ## Running REST APIs
 
-There are existing REST APIs in the `src/cnlpt/api` folder for a few important clinical NLP tasks:
+This library supports serving a REST API for your model with a single `/process` endpoint to process text and generate predictions, via the `cnlpt rest` command.
 
-1. Negation detection
-2. Time expression tagging (spans + time classes)
-3. Event detection (spans + document creation time relation)
-4. End-to-end temporal relation extraction (event spans+DTR+timex spans+time classes+narrative container [CONTAINS] relation extraction)
+Run `cnlpt rest --help` to see available options. The only required option is `--model`, which must be either a HuggingFace repository or a local directory containing your model. By default, the model will be served at [http://localhost:8000](http://localhost:8000).
 
-### Negation API
+For example, to run our negation detection model from HuggingFace:
 
-To demo the negation API:
-
-1. Install the `cnlp-transformers` package.
-2. Run `cnlpt rest --model-type negation [-p PORT]`.
-3. Open a python console and run the following commands:
-
-#### Setup variables for negation
-
-```ipython
->>> import requests
->>> process_url = 'http://hostname:8000/negation/process'  ## Replace hostname with your host name
-```
-
-#### Prepare the document
-
-```ipython
->>> sent = 'The patient has a sore knee and headache but denies nausea and has no anosmia.'
->>> ents = [[18, 27], [32, 40], [52, 58], [70, 77]]
->>> doc = {'doc_text':sent, 'entities':ents}
+```bash
+cnlpt rest --model mlml-chip/negation_pubmedbert_sharpseed
 ```
 
-#### Process the document
-
-```ipython
->>> r = requests.post(process_url, json=doc)
->>> r.json()
-```
-
-Output: `{'statuses': [-1, -1, 1, 1]}`
-
-The model correctly classifies both nausea and anosmia as negated.
-
-### Temporal API (End-to-end temporal information extraction)
-
-To demo the temporal API:
-
-1. Install the `cnlp-transformers` package.
-2. Run `cnlpt rest --model-type temporal [-p PORT]`
-3. Open a python console and run the following commands to test:
-
-#### Setup variables for temporal
+Once the application is running, you can either interact with it via web interface at [http://localhost:8000/docs](http://localhost:8000/docs), or manually send requests to the `/process` endpoint:
 
 ```ipython
 >>> import requests
 >>> from pprint import pprint
->>> process_url = 'http://hostname:8000/temporal/process_sentence'  ## Replace hostname with your host name
+>>> sent = "The patient has a sore knee and headache but denies nausea and has no anosmia."
+>>> ents = [(18, 27), (32, 40), (52, 58), (70, 77)]
+>>> doc = {"text": sent, "entity_spans": ents}
+>>> resp = requests.post("http://localhost:8000/process", json=doc)
+>>> pprint(resp.json())
+[{'Negation': {'prediction': '-1',
+               'probs': {'-1': 0.9997619986534119, '1': 0.0002379878715146333}},
+  'text': 'The patient has a <e>sore knee</e> and headache but denies nausea '
+          'and has no anosmia.'},
+ {'Negation': {'prediction': '-1',
+               'probs': {'-1': 0.9995606541633606, '1': 0.0004393413255456835}},
+  'text': 'The patient has a sore knee and <e>headache</e> but denies nausea '
+          'and has no anosmia.'},
+ {'Negation': {'prediction': '1',
+               'probs': {'-1': 0.007858583703637123, '1': 0.9921413660049438}},
+  'text': 'The patient has a sore knee and headache but denies <e>nausea</e> '
+          'and has no anosmia.'},
+ {'Negation': {'prediction': '1',
+               'probs': {'-1': 0.0071166763082146645, '1': 0.9928833246231079}},
+  'text': 'The patient has a sore knee and headache but denies nausea and has '
+          'no <e>anosmia</e>.'}]
 ```
 
-#### Prepare and process the document
+You can also serve multiple models at once by providing a router prefix for each model, e.g.:
 
-```ipython
->>> sent = 'The patient was diagnosed with adenocarcinoma March 3, 2010 and will be returning for chemotherapy next week.'
->>> r = requests.post(process_url, json={'sentence':sent})
->>> pprint(r.json())
+```bash
+cnlpt rest --model /negation=mlml-chip/negation_pubmedbert_sharpseed --model /temporal=mlml-chip/thyme2_colon_e2e
 ```
 
-should return:
-
-```json
-{
-  "events": [
-    [
-      {"begin": 3, "dtr": "BEFORE", "end": 3},
-      {"begin": 5, "dtr": "BEFORE", "end": 5},
-      {"begin": 13, "dtr": "AFTER", "end": 13},
-      {"begin": 15, "dtr": "AFTER", "end": 15}
-    ]
-  ],
-  "relations": [
-    [
-      {"arg1": "TIMEX-0", "arg2": "EVENT-0", "category": "CONTAINS"},
-      {"arg1": "EVENT-2", "arg2": "EVENT-3", "category": "CONTAINS"},
-      {"arg1": "TIMEX-1", "arg2": "EVENT-2", "category": "CONTAINS"},
-      {"arg1": "TIMEX-1", "arg2": "EVENT-3", "category": "CONTAINS"}
-    ]
-  ],
-  "timexes": [
-    [
-      {"begin": 6, "end": 9, "timeClass": "DATE"},
-      {"begin": 16, "end": 17, "timeClass": "DATE"}
-    ]
-  ]
-}
-```
-
-This output indicates the token spans of events and timexes, and relations between events and timexes, where the suffixes are indices into the respective arrays (e.g., TIMEX-0 in a relation refers to the 0th time expression found, which begins at token 6 and ends at token 9 -- ["March 3, 2010"])
-
 ## Citing cnlp_transformers
 
 Please use the following bibtex to cite cnlp_transformers if you use it in a publication:
 
@@ -4,7 +4,7 @@
 from transformers.models.auto.tokenization_auto import AutoTokenizer
 
 from cnlpt.legacy.train_system import is_hub_model
-from cnlpt.models import CnlpModelForClassification, HierarchicalModel
+from cnlpt.modeling import CnlpModelForClassification, HierarchicalModel
 
 
 def pre_initialize_cnlpt_model(model_name, cuda=True, batch_size=8):
 
@@ -0,0 +1,2 @@
+*/dataset/
+*/train_output/
@@ -1,24 +1,22 @@
 # Fine-tuning for tagging: End-to-end example
 
-1. Preprocess the data with `uv run examples/chemprot/prepare_chemprot_dataset.py data/chemprot`
+1. Preprocess the data with `uv run examples/chemprot/prepare_chemprot_dataset.py`
 
-2. Fine-tune with something like:
+2. Fine-tune for NER with something like:
 
 ```bash
-cnlpt train \ 
- --task_name chemical_ner gene_ner \
- --data_dir data/chemprot \
- --encoder_name allenai/scibert_scivocab_uncased  \
- --do_train \
- --do_eval \
- --cache_dir cache/ \
- --output_dir temp/  \
+uv run cnlpt train \
+ --model_type proj \
+ --encoder allenai/scibert_scivocab_uncased \
+ --data_dir ./dataset \
+ --task chemical_ner --task gene_ner \
+ --output_dir ./train_output \
  --overwrite_output_dir \
- --num_train_epochs 50 \
+ --do_train --do_eval \
+ --num_train_epochs 3 \
  --learning_rate 2e-5 \
  --lr_scheduler_type constant \
- --report_to none \
- --save_strategy no \
+ --save_strategy best \
  --gradient_accumulation_steps 1 \
  --eval_accumulation_steps 10 \
  --weight_decay 0.2
 
@@ -1,19 +1,24 @@
 import bisect
 import itertools
-import os
 import re
 from dataclasses import dataclass
-from sys import argv
-from typing import Any, Union
+from pathlib import Path
+from typing import Any
 
 import polars as pl
 from datasets import load_dataset
 from datasets.dataset_dict import Dataset, DatasetDict
+from datasets.utils import disable_progress_bars, enable_progress_bars
 from rich.console import Console
 
 
 def load_chemprot_dataset(cache_dir="./cache") -> DatasetDict:
-    return load_dataset("bigbio/chemprot", "chemprot_full_source", cache_dir=cache_dir)
+    disable_progress_bars()
+    dataset = load_dataset(
+        "bigbio/chemprot", "chemprot_full_source", cache_dir=cache_dir
+    )
+    enable_progress_bars()
+    return dataset
 
 
 def clean_text(text: str):
@@ -156,25 +161,18 @@ def preprocess_data(split: Dataset):
     )
 
 
-def main(out_dir: Union[str, os.PathLike]):
+if __name__ == "__main__":
     console = Console()
-
-    if not os.path.isdir(out_dir):
-        os.mkdir(out_dir)
+    out_dir = Path(__file__).parent / "dataset"
+    out_dir.mkdir(exist_ok=True)
 
     with console.status("Loading dataset...") as st:
         dataset = load_chemprot_dataset()
         for split in ("train", "test", "validation"):
             st.update(f"Preprocessing {split} data...")
             preprocessed = preprocess_data(dataset[split])
-            preprocessed.write_csv(
-                os.path.join(out_dir, f"{split}.tsv"), separator="\t"
-            )
+            preprocessed.write_csv(out_dir / f"{split}.tsv", separator="\t")
 
     console.print(
         f"[green i]Preprocessed chemprot data saved to [repr.filename]{out_dir}[/]."
     )
-
-
-if __name__ == "__main__":
-    main(argv[1])
@@ -1,74 +1,66 @@
-### Fine-tuning for classification: End-to-end example
+# Drug Review Sentiment Classification
 
-1. Download data from [Drug Reviews (Druglib.com) Data Set](https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com) to `data` folder and extract. Pay attention to their terms:
-   1. only use the data for research purposes
-   2. don't use the data for any commerical purposes
-   3. don't distribute the data to anyone else
-   4. cite us
+## Jupyter notebook example
 
-2. Run ```python examples/uci_drug/transform_uci_drug.py <raw dir> <processed dir>``` to preprocess the data from the extract directory into a new directory. This will create {train,dev,test}.tsv in the processed directory specified, where the sentiment ratings have been collapsed into 3 categories.
+See the [example notebook](./uci_drug.ipynb) for a step-by-step walkthrough of
+how to use CNLPT to train a model for sentiment classification of drug reviews.
 
-3. Fine-tune with something like:
+## CLI example
 
-```bash
-cnlpt train \
- --data_dir <processed dir> \
- --task_name sentiment \
- --encoder_name roberta-base \
- --do_train \
- --do_eval \
- --cache_dir cache/ \
- --output_dir temp/ \
- --overwrite_output_dir \
- --evals_per_epoch 5 \
- --num_train_epochs 1 \
- --learning_rate 1e-5 \
- --report_to none \
- --metric_for_best_model eval_sentiment.avg_micro_f1 \
- --load_best_model_at_end \
- --save_strategy best
-```
-
-On our hardware, that command results in eval performance like the following:
-```sentiment = {'acc': 0.7041800643086816, 'f1': [0.7916666666666666, 0.7228915662650603, 0.19444444444444442], 'acc_and_f1': [0.7479233654876741, 0.7135358152868709, 0.449312254376563], 'recall': [0.8216216216216217, 0.8695652173913043, 0.12280701754385964], 'precision': [0.7638190954773869, 0.6185567010309279, 0.4666666666666667]}```
-
-#### Error Analysis for Classification
-
-If you run the above command with the `--error_analysis` flag, you can obtain the `dev` instances for which the model made an erroneous
-prediction, organized by their original index in `dev` split, in the `eval_predictions...tsv` file in the `--output_dir` argument.  
-For us the first line of this file (after the header) is:
-
-```
-        text    sentiment
-2       Benefits: <cr> helped aleviate whip lash symptoms <cr> Side effects: <cr> none that i noticed <cr> Overall comments: <cr> i took the medications for the prescribed time and symptoms improved, however, I still have some symptoms which are being treated through physical therapy since the accident was only in December     Ground: Medium Predicted: High
-
-```
-
-The number at the beginning of the line, 2, is the index of the instance in the `dev` split.  The `text` column contains the text of the erroneous instances and the following columns are the tasks provided to the model, in this case, just `sentiment`.  `Ground: Medium Predicted: High` indicates that the provided ground truth label for the instance sentiment is `Medium` but the model predicted `High`.  
-
-#### Human Readable Predictions for Classification
+If you prefer, you can instead use the CLI to train the model:
 
-Similarly if you run the above command with `--do_predict` you can obtain human readable predictions for the `test` split, in the `test_predictions...tsv` file.  For us the first line of this file (after the header) is:
-
-```
-0       Benefits: <cr> The antibiotic may have destroyed bacteria causing my sinus infection.  But it may also have been caused by a virus, so its hard to say. <cr> Side effects: <cr> Some back pain, some nauseau. <cr> Overall comments: <cr> Took the antibiotics for 14 days. Sinus infection was gone after the 6th day.  Low
-
-```
-
-##### Prediction Probability Outputs for Classification
-
-(Currently only supported for classification tasks), if you run the above command with the `--output_prob` flag, you can see the model's softmax-obtained probability for the predicted classification label.  The first error analysis sample from `dev` would now looks like:
-
-```
- text sentiment
-2       Benefits: <cr> helped aleviate whip lash symptoms <cr> Side effects: <cr> none that i noticed <cr> Overall comments: <cr> i took the medications for the prescribed time and symptoms improved, however, I still have some symptoms which are being treated through physical therapy since the accident was only in December     Ground: Medium Predicted: High , Probability 0.613825
+### Download and preprocess the data
 
+Use the [`prepare_data.py`](./prepare_data.py) script to download the data and convert it to CNLPT's data format:
 
+```bash
+uv run prepare_data.py
 ```
 
-And the first prediction sample from `test` now looks like:
+> [!TIP] About the dataset:
+> This script downloads the
+> [*Drug Reviews (Druglib.com)* dataset](https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com).
+> Please be aware of the terms of use:
+>
+> > Important Notes:
+> >
+> > When using this dataset, you agree that you
+> >
+> > 1) only use the data for research purposes
+> > 2) don't use the data for any commerical purposes
+> > 3) don't distribute the data to anyone else
+> > 4) cite UCI data lab and the source
+>
+> Here is the dataset's BibTeX citation:
+>
+> ```bibtex
+> @misc{drug_reviews_(druglib.com)_461,
+>   author       = {Kallumadi, Surya and Grer, Felix},
+>   title        = {{Drug Reviews (Druglib.com)}},
+>   year         = {2018},
+>   howpublished = {UCI Machine Learning Repository},
+>   note         = {{DOI}: https://doi.org/10.24432/C55G6J}
+> }
+> ```
+
+### Train a model
+
+The following example fine-tunes
+[the RoBERTa base model](https://huggingface.co/FacebookAI/roberta-base)
+with an added projection layer for classification:
 
-```
-        text    sentiment
-0       Benefits: <cr> The antibiotic may have destroyed bacteria causing my sinus infection.  But it may also have been caused by a virus, so its hard to say. <cr> Side effects: <cr> Some back pain, some nauseau. <cr> Overall comments: <cr> Took the antibiotics for 14 days. Sinus infection was gone after the 6th day.  Low , Probability 0.370522
+```bash
+uv run cnlpt train \
+ --model_type proj \
+ --encoder roberta-base \
+ --data_dir ./dataset \
+ --task sentiment \
+ --output_dir ./train_output \
+ --overwrite_output_dir \
+ --do_train --do_eval --do_predict \
+ --evals_per_epoch 2 \
+ --learning_rate 1e-5 \
+ --metric_for_best_model 'sentiment.macro_f1' \
+ --load_best_model_at_end \
+ --save_strategy best
 ```