Skip to content

Commit 19b696b

Browse files
committed
CU-8698x63kt: Add v2 migration guide (#66)
* CU-8698x63kt: Add v2 migration guide * CU-8698x63kt: Add v2 migration guide link to main README
1 parent 5a35e08 commit 19b696b

File tree

2 files changed

+201
-1
lines changed

2 files changed

+201
-1
lines changed

medcat-v2/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# Medical <img src="https://github.com/CogStack/cogstack-nlp/blob/main/media/cat-logo.png?raw=true" width=45> oncept Annotation Tool (version 2)
22

33
**There's a number of breaking changes in MedCAT v2 compared to v1.**
4-
Details are outlined [here](docs/breaking_changes.md).
4+
When moving from v1 to v2, please refer to the [migration guide](docs/migration_guide_v2.md).
5+
Details on breaking are outlined [here](docs/breaking_changes.md).
56

67
[![Build Status](https://github.com/CogStack/cogstack-nlp/actions/workflows/medcat-v2_main.yml/badge.svg?branch=main)](https://github.com/CogStack/cogstack-nlp/actions/workflows/medcat-v2_main.yml/badge.svg?branch=main)
78
[![Documentation Status](https://readthedocs.org/projects/cogstack-nlp/badge/?version=latest)](https://readthedocs.org/projects/cogstack-nlp/badge/?version=latest)
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# MedCAT v2 Migration Guide
2+
3+
Welcome to [MedCAT v2](https://docs.cogstack.org/projects/nlp/en/latest/)!
4+
5+
This guide is for users upgrading from **v1.x** to **v2.x** of MedCAT.
6+
It covers what’s changed, what steps one needs to do to upgrade, and what to expect from the new version.
7+
For most single threaded inference users, things will continue to work as before.
8+
Though APIs for training (both supervised and unsupervised) have been **refactored** somewhat.
9+
10+
---
11+
12+
## Why v2?
13+
14+
MedCAT v2 is a refactor designed to:
15+
- Increase modularity
16+
- The core library is a lot more light weight and only includes essential components
17+
- Additional features (many of which were always provided in v1) that need to explicitly be specified upon install
18+
- `spacy` for tokenizing
19+
- `deid` for transformers based NER / deidentification
20+
- `meta-cat` for meta annotations (both LSTM and BERT)
21+
- `rel-cat` for relation extraction
22+
- The above means that `pip install medcat>=2.0` will **not** include everything that came with v1
23+
- And **models built / saved in v1 will not be able to loaded** in this install
24+
- There will be more details on installs in the next section(s)
25+
- This comes with a number of clear advantages
26+
- Smaller installs
27+
- You don't need to install components you're not going to use
28+
- Better separation / grouping of dependencies
29+
- Each separate feature defines their own dependencies
30+
- Lower internal coupling with `spacy`
31+
- This allows us to use other tokenizers, at least for the built in NER and Linker
32+
- There's now registration available for other tokenizers
33+
- There's even an example of a regular expression based tokenizer built into the library
34+
- This serves more as a sample rather than an actual alternative
35+
- Increase extensibility and flexibility
36+
- It's now a lot easier to create new components
37+
- Core components (NER, Linker)
38+
- Addons (MetaCAT, RelCAT)
39+
- Improve maintainability of code and models
40+
- Prepare for future use cases and integrations
41+
42+
---
43+
44+
## Who should read this?
45+
46+
If you're:
47+
- Using MedCAT v1 (almost everything prior to **August 2025**)
48+
- Loading or training models saved before that date
49+
- Calling internal APIs (beyond basic `cat.get_entities`)
50+
51+
...then this guide is for you.
52+
53+
---
54+
55+
## How to install v2
56+
57+
Upgrading to the latest MedCAT version depends a little bit on which features you want / need.
58+
If you want an identical experience to v1, you should be able to simply:
59+
60+
```bash
61+
pip install -U "medcat[spacy,meta-cat,rel-cat,deid]>=2.0"
62+
```
63+
64+
However, you may want to avoid installing of some of the additional features if you do not need them.
65+
Here's a list of the additional features you can opt for with what they're used for.
66+
| Feature Group | Install Name | Description |
67+
| ------------------- | ------------ | -------------------------------------------------------------------------- |
68+
| `spaCy` Tokenizer | `spacy` | Enables `spacy`-based tokenization, as used in MedCAT v1 |
69+
| MetaCAT Annotations | `meta-cat` | Supports meta-annotations like temporality, presence, and relevance |
70+
| Transformer NER | `deid` | Enables transformer-based NER, primarily used for de-identification models |
71+
| Relation Extraction | `rel-cat` | Adds support for extracting relations between entities |
72+
| Dictionary NER | `dict-ner` | Example dictionary NER module (experimental and rarely needed) |
73+
74+
## Summary of Changes
75+
76+
See the full list of breaking changes [here](breaking_changes.md).
77+
This is just a small summary
78+
79+
### What hasn’t changed
80+
- Core single threaded inference APIs (`cat.get_entities`, `cat.__call__`)
81+
- Model loading: `CAT.load_model_pack` still works very similarly
82+
- Your existing v1 models are still usable
83+
- They will be converted on the fly when loaded
84+
85+
### What _has_ changed
86+
- Training goes through a new class-based API
87+
- Instead of `cat.train` you can use `cat.trainer.train_unsupervised`
88+
- Instead of `cat.train_supervised_raw` you can use `cat.trainer.train_supervised_raw`
89+
- Save method renamed somewhat to be
90+
- Renamed from `cat.create_model_pack` to `cat.save_model_pack`
91+
- Internal structure of concepts / names is more structured
92+
- There's the `cdb.cui2info` and `cdb.name2info` maps
93+
- More details in the breaking changes overview
94+
- Models are saved in a new format
95+
- The idea was to simplify the (potential) addition of other serialisation options
96+
- Most of the model handling is still the same
97+
- There's a `.zip` to move around if/when needed
98+
- The model pack unpacks into its components
99+
- Model components are saved differently
100+
- This mostly affects MetaCAT and RelCAT models
101+
- Components are saved in the `saved_components` folder within the model folder
102+
- E.g `saved_components/addon_meta_cat.Presence` for MetaCAT and `addon_rel_cat.rel_cat` for RelCAT
103+
104+
## ⚠️ Loading v1 models
105+
106+
MedCAT v2 supports loading v1 models.
107+
There is no need to retrain them.
108+
However, loading will:
109+
- be significantly slower due to on-the-fly conversion
110+
- show a warning message about this slowdown
111+
112+
We recommend re-saving v1 models using `cat.save_model_pack` in v2 format to mitigate this.
113+
114+
115+
## Updated Tutorials
116+
117+
All v2 tutorials have been completely redone.
118+
They do not go as far into detail in everything as the v1 tutorials did.
119+
But they should hopefully cover most of the use cases
120+
The v2 tutorials are available [here](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-v2-tutorials).
121+
122+
## Updated `working_with_cogstack` scripts
123+
124+
The `working_with_cogstack` scripts have also been upgraded to support v2.
125+
The changes are currently in [this PR](https://github.com/CogStack/working_with_cogstack/pull/20).
126+
They have not yet been merged in to the `main` branch but will be in the near future.
127+
At that point, there will probably be a separate branch to keep track of v1-specific scripts.
128+
129+
## MedCATtrainer
130+
131+
MedCATtrainer has been modified to work with v2 in [this PR](https://github.com/CogStack/MedCATtrainer/pull/253).
132+
However, as of writing, this change has not yet been merged in or been released.
133+
The v2-supporting release will most likely be released as **v3** on the trainer side.
134+
135+
## Feedback welcome!
136+
137+
We’d love your input / feedback!
138+
Please report any issues or feature requests you encounter.
139+
That includes (but is not limited to)
140+
- Inability to use / run / load old models
141+
- Missing or unclear documentation
142+
- Unexpected errors or regressions
143+
- Confusing logs or error messages
144+
- Any other usability feedback
145+
146+
Create a [GitHub issue](https://github.com/CogStack/cogstack-nlp/issues/new) or start a thread on [Discourse](https://discourse.cogstack.org/).
147+
148+
## FAQ
149+
150+
**Q: Do I need to retrain my model?**
151+
152+
A: v1 models still work, but loading them is slower. We recommend re-saving after loading.
153+
154+
**Q: Why is model loading slower than before?**
155+
156+
A: v1 models are converted at load time to the new internal format. Once re-saved, load speed will be similar to before
157+
158+
**Q: Does inference break in v2?**
159+
160+
A: Using `cat.get_entities` should be identical, but multiprocessing is somewhat different, see [breaking changes](breaking_changes.md) for details.
161+
162+
**Q: What extras do I need for a converted NER+EL model (no MetaCAT)?**
163+
164+
A: You just need `spacy`. So `pip install "medcat[spacy]>=2.0"` should be sufficient.
165+
166+
**Q: What extras do I need for a converted DeID model?**
167+
168+
A: You need `spacy` (for base tokenization) as well as `deid`. So `pip install "medcat[spacy,deid]>=2.0"` should be sufficient.
169+
170+
**Q: What extras do I need for a converted NER+L model with MetaCAT?**
171+
172+
A: You need `spacy` (for base tokenization) as well as `meta-cat`. So `pip install "medcat[spacy,meta-cat]>=2.0"` should be sufficient.
173+
174+
**Q: What extras do I need for a converted NER+L model with RelCAT?**
175+
176+
A: You need `spacy` (for base tokenization) as well as `rel-cat`. So `pip install "medcat[spacy,rel-cat]>=2.0"` should be sufficient.
177+
178+
**Q: How do I train in v2?**
179+
180+
A: Training now uses a dedicated `medcat.trainer.Trainer` class. See tutorials and/or [breaking changes](breaking_changes.md) for details.
181+
182+
**Q: Are v1 `working_with_cogstack` scripts still supported?**
183+
184+
A: No. Many will break due to internal changes. Please refer to the new scripts in the [relevant branch](https://github.com/CogStack/working_with_cogstack/pull/20).
185+
186+
187+
**Q: Does MedCATtrainer work out of the box for v2?**
188+
189+
A: No. While the [changes have been ported](https://github.com/CogStack/MedCATtrainer/pull/253), there is currently no release for these changes and it is unlikely to already be spun up yet. But it will be soon.
190+
191+
192+
**Q: Does `medcat-service` work for serving a model?**
193+
194+
A: The [service](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-service) has been fully ported to v2.
195+
196+
197+
**Q: Does the demo app work with v2?**
198+
199+
A: The [demo web app](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-demo-app) has been fully ported to v2.

0 commit comments

Comments
 (0)