|
| 1 | +# MedCAT v2 Migration Guide |
| 2 | + |
| 3 | +Welcome to [MedCAT v2](https://docs.cogstack.org/projects/nlp/en/latest/)! |
| 4 | + |
| 5 | +This guide is for users upgrading from **v1.x** to **v2.x** of MedCAT. |
| 6 | +It covers what’s changed, what steps one needs to do to upgrade, and what to expect from the new version. |
| 7 | +For most single threaded inference users, things will continue to work as before. |
| 8 | +Though APIs for training (both supervised and unsupervised) have been **refactored** somewhat. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Why v2? |
| 13 | + |
| 14 | +MedCAT v2 is a refactor designed to: |
| 15 | +- Increase modularity |
| 16 | + - The core library is a lot more light weight and only includes essential components |
| 17 | + - Additional features (many of which were always provided in v1) that need to explicitly be specified upon install |
| 18 | + - `spacy` for tokenizing |
| 19 | + - `deid` for transformers based NER / deidentification |
| 20 | + - `meta-cat` for meta annotations (both LSTM and BERT) |
| 21 | + - `rel-cat` for relation extraction |
| 22 | + - The above means that `pip install medcat>=2.0` will **not** include everything that came with v1 |
| 23 | + - And **models built / saved in v1 will not be able to loaded** in this install |
| 24 | + - There will be more details on installs in the next section(s) |
| 25 | + - This comes with a number of clear advantages |
| 26 | + - Smaller installs |
| 27 | + - You don't need to install components you're not going to use |
| 28 | + - Better separation / grouping of dependencies |
| 29 | + - Each separate feature defines their own dependencies |
| 30 | +- Lower internal coupling with `spacy` |
| 31 | + - This allows us to use other tokenizers, at least for the built in NER and Linker |
| 32 | + - There's now registration available for other tokenizers |
| 33 | + - There's even an example of a regular expression based tokenizer built into the library |
| 34 | + - This serves more as a sample rather than an actual alternative |
| 35 | +- Increase extensibility and flexibility |
| 36 | + - It's now a lot easier to create new components |
| 37 | + - Core components (NER, Linker) |
| 38 | + - Addons (MetaCAT, RelCAT) |
| 39 | +- Improve maintainability of code and models |
| 40 | +- Prepare for future use cases and integrations |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Who should read this? |
| 45 | + |
| 46 | +If you're: |
| 47 | +- Using MedCAT v1 (almost everything prior to **August 2025**) |
| 48 | +- Loading or training models saved before that date |
| 49 | +- Calling internal APIs (beyond basic `cat.get_entities`) |
| 50 | + |
| 51 | +...then this guide is for you. |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## How to install v2 |
| 56 | + |
| 57 | +Upgrading to the latest MedCAT version depends a little bit on which features you want / need. |
| 58 | +If you want an identical experience to v1, you should be able to simply: |
| 59 | + |
| 60 | +```bash |
| 61 | +pip install -U "medcat[spacy,meta-cat,rel-cat,deid]>=2.0" |
| 62 | +``` |
| 63 | + |
| 64 | +However, you may want to avoid installing of some of the additional features if you do not need them. |
| 65 | +Here's a list of the additional features you can opt for with what they're used for. |
| 66 | +| Feature Group | Install Name | Description | |
| 67 | +| ------------------- | ------------ | -------------------------------------------------------------------------- | |
| 68 | +| `spaCy` Tokenizer | `spacy` | Enables `spacy`-based tokenization, as used in MedCAT v1 | |
| 69 | +| MetaCAT Annotations | `meta-cat` | Supports meta-annotations like temporality, presence, and relevance | |
| 70 | +| Transformer NER | `deid` | Enables transformer-based NER, primarily used for de-identification models | |
| 71 | +| Relation Extraction | `rel-cat` | Adds support for extracting relations between entities | |
| 72 | +| Dictionary NER | `dict-ner` | Example dictionary NER module (experimental and rarely needed) | |
| 73 | + |
| 74 | +## Summary of Changes |
| 75 | + |
| 76 | +See the full list of breaking changes [here](breaking_changes.md). |
| 77 | +This is just a small summary |
| 78 | + |
| 79 | +### What hasn’t changed |
| 80 | +- Core single threaded inference APIs (`cat.get_entities`, `cat.__call__`) |
| 81 | +- Model loading: `CAT.load_model_pack` still works very similarly |
| 82 | +- Your existing v1 models are still usable |
| 83 | + - They will be converted on the fly when loaded |
| 84 | + |
| 85 | +### What _has_ changed |
| 86 | +- Training goes through a new class-based API |
| 87 | + - Instead of `cat.train` you can use `cat.trainer.train_unsupervised` |
| 88 | + - Instead of `cat.train_supervised_raw` you can use `cat.trainer.train_supervised_raw` |
| 89 | +- Save method renamed somewhat to be |
| 90 | + - Renamed from `cat.create_model_pack` to `cat.save_model_pack` |
| 91 | +- Internal structure of concepts / names is more structured |
| 92 | + - There's the `cdb.cui2info` and `cdb.name2info` maps |
| 93 | + - More details in the breaking changes overview |
| 94 | +- Models are saved in a new format |
| 95 | + - The idea was to simplify the (potential) addition of other serialisation options |
| 96 | + - Most of the model handling is still the same |
| 97 | + - There's a `.zip` to move around if/when needed |
| 98 | + - The model pack unpacks into its components |
| 99 | +- Model components are saved differently |
| 100 | + - This mostly affects MetaCAT and RelCAT models |
| 101 | + - Components are saved in the `saved_components` folder within the model folder |
| 102 | + - E.g `saved_components/addon_meta_cat.Presence` for MetaCAT and `addon_rel_cat.rel_cat` for RelCAT |
| 103 | + |
| 104 | +## ⚠️ Loading v1 models |
| 105 | + |
| 106 | +MedCAT v2 supports loading v1 models. |
| 107 | +There is no need to retrain them. |
| 108 | +However, loading will: |
| 109 | +- be significantly slower due to on-the-fly conversion |
| 110 | +- show a warning message about this slowdown |
| 111 | + |
| 112 | +We recommend re-saving v1 models using `cat.save_model_pack` in v2 format to mitigate this. |
| 113 | + |
| 114 | + |
| 115 | +## Updated Tutorials |
| 116 | + |
| 117 | +All v2 tutorials have been completely redone. |
| 118 | +They do not go as far into detail in everything as the v1 tutorials did. |
| 119 | +But they should hopefully cover most of the use cases |
| 120 | +The v2 tutorials are available [here](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-v2-tutorials). |
| 121 | + |
| 122 | +## Updated `working_with_cogstack` scripts |
| 123 | + |
| 124 | +The `working_with_cogstack` scripts have also been upgraded to support v2. |
| 125 | +The changes are currently in [this PR](https://github.com/CogStack/working_with_cogstack/pull/20). |
| 126 | +They have not yet been merged in to the `main` branch but will be in the near future. |
| 127 | +At that point, there will probably be a separate branch to keep track of v1-specific scripts. |
| 128 | + |
| 129 | +## MedCATtrainer |
| 130 | + |
| 131 | +MedCATtrainer has been modified to work with v2 in [this PR](https://github.com/CogStack/MedCATtrainer/pull/253). |
| 132 | +However, as of writing, this change has not yet been merged in or been released. |
| 133 | +The v2-supporting release will most likely be released as **v3** on the trainer side. |
| 134 | + |
| 135 | +## Feedback welcome! |
| 136 | + |
| 137 | +We’d love your input / feedback! |
| 138 | +Please report any issues or feature requests you encounter. |
| 139 | +That includes (but is not limited to) |
| 140 | +- Inability to use / run / load old models |
| 141 | +- Missing or unclear documentation |
| 142 | +- Unexpected errors or regressions |
| 143 | +- Confusing logs or error messages |
| 144 | +- Any other usability feedback |
| 145 | + |
| 146 | +Create a [GitHub issue](https://github.com/CogStack/cogstack-nlp/issues/new) or start a thread on [Discourse](https://discourse.cogstack.org/). |
| 147 | + |
| 148 | +## FAQ |
| 149 | + |
| 150 | +**Q: Do I need to retrain my model?** |
| 151 | + |
| 152 | +A: v1 models still work, but loading them is slower. We recommend re-saving after loading. |
| 153 | + |
| 154 | +**Q: Why is model loading slower than before?** |
| 155 | + |
| 156 | +A: v1 models are converted at load time to the new internal format. Once re-saved, load speed will be similar to before |
| 157 | + |
| 158 | +**Q: Does inference break in v2?** |
| 159 | + |
| 160 | +A: Using `cat.get_entities` should be identical, but multiprocessing is somewhat different, see [breaking changes](breaking_changes.md) for details. |
| 161 | + |
| 162 | +**Q: What extras do I need for a converted NER+EL model (no MetaCAT)?** |
| 163 | + |
| 164 | +A: You just need `spacy`. So `pip install "medcat[spacy]>=2.0"` should be sufficient. |
| 165 | + |
| 166 | +**Q: What extras do I need for a converted DeID model?** |
| 167 | + |
| 168 | +A: You need `spacy` (for base tokenization) as well as `deid`. So `pip install "medcat[spacy,deid]>=2.0"` should be sufficient. |
| 169 | + |
| 170 | +**Q: What extras do I need for a converted NER+L model with MetaCAT?** |
| 171 | + |
| 172 | +A: You need `spacy` (for base tokenization) as well as `meta-cat`. So `pip install "medcat[spacy,meta-cat]>=2.0"` should be sufficient. |
| 173 | + |
| 174 | +**Q: What extras do I need for a converted NER+L model with RelCAT?** |
| 175 | + |
| 176 | +A: You need `spacy` (for base tokenization) as well as `rel-cat`. So `pip install "medcat[spacy,rel-cat]>=2.0"` should be sufficient. |
| 177 | + |
| 178 | +**Q: How do I train in v2?** |
| 179 | + |
| 180 | +A: Training now uses a dedicated `medcat.trainer.Trainer` class. See tutorials and/or [breaking changes](breaking_changes.md) for details. |
| 181 | + |
| 182 | +**Q: Are v1 `working_with_cogstack` scripts still supported?** |
| 183 | + |
| 184 | +A: No. Many will break due to internal changes. Please refer to the new scripts in the [relevant branch](https://github.com/CogStack/working_with_cogstack/pull/20). |
| 185 | + |
| 186 | + |
| 187 | +**Q: Does MedCATtrainer work out of the box for v2?** |
| 188 | + |
| 189 | +A: No. While the [changes have been ported](https://github.com/CogStack/MedCATtrainer/pull/253), there is currently no release for these changes and it is unlikely to already be spun up yet. But it will be soon. |
| 190 | + |
| 191 | + |
| 192 | +**Q: Does `medcat-service` work for serving a model?** |
| 193 | + |
| 194 | +A: The [service](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-service) has been fully ported to v2. |
| 195 | + |
| 196 | + |
| 197 | +**Q: Does the demo app work with v2?** |
| 198 | + |
| 199 | +A: The [demo web app](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-demo-app) has been fully ported to v2. |
0 commit comments