Skip to content

Commit 0982f94

Browse files
committed
fix: preserve the original texts order during bulk processing
ci: use the GHA cache during CMS docker build
1 parent 5e6b52c commit 0982f94

File tree

7 files changed

+115
-26
lines changed

7 files changed

+115
-26
lines changed

.github/workflows/docker.yaml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,6 @@ jobs:
5050
username: ${{ secrets.DOCKERHUB_USERNAME }}
5151
password: ${{ secrets.DOCKERHUB_TOKEN }}
5252

53-
- name: Cache Docker layers
54-
uses: actions/cache@v4
55-
with:
56-
path: /tmp/.buildx-cache
57-
key: ${{ runner.os }}-buildx-${{ github.sha }}
58-
restore-keys: |
59-
${{ runner.os }}-buildx-
60-
6153
- name: Build and push CMS
6254
uses: docker/build-push-action@v6
6355
id: build_and_push_cms
@@ -68,8 +60,8 @@ jobs:
6860
push: true
6961
tags: |
7062
${{ env.REGISTRY }}/${{ env.DOCKER_IMAGE_NAME }}:dev
71-
cache-from: type=local,src=/tmp/.buildx-cache
72-
cache-to: type=local,dest=/tmp/.buildx-cache,mode=max
63+
cache-from: type=gha
64+
cache-to: type=gha,mode=max
7365

7466
- name: Attest image artifacts
7567
uses: actions/attest-build-provenance@v2

app/api/routers/training_operations.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ async def train_eval_info(request: Request,
4545

4646

4747
@router.post("/evaluate",
48-
tags=[Tags.Evaluating.name],
48+
tags=[Tags.Training.name],
4949
response_class=JSONResponse,
5050
dependencies=[Depends(cms_globals.props.current_active_user)],
5151
description="Evaluate the model being served with a trainer export")

app/model_services/medcat_model.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ def batch_annotate(self, texts: List[str]) -> List[List[Annotation]]:
112112
nproc=max(int(cpu_count() / 2), 1),
113113
addl_info=["cui2icd10", "cui2ontologies", "cui2snomed", "cui2athena_ids"]
114114
)
115+
docs = dict(sorted(docs.items(), key=lambda x: x[0]))
115116
annotations_list = []
116117
for _, doc in docs.items():
117118
annotations_list.append([Annotation.parse_obj(record) for record in self.get_records_from_doc(doc)])

tests/integration/conftest.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
import os
22

33
os.environ["PYTHONPATH"] = os.path.join(os.path.dirname(__file__), "..", "..")
4-
print(os.environ["PYTHONPATH"])

tests/integration/features/serving.feature

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
Feature:
22
CogStack ModelServe APIs
33

4+
@status
45
Scenario Outline: Get general information about server healthiness, readiness and the running model
56
Given CMS app is up and running
67
When I send a GET request to <endpoint>
@@ -12,92 +13,103 @@ Feature:
1213
| /readyz | medcat_umls | 200 |
1314
| /info | "model_type":"medcat_umls" | 200 |
1415

16+
@ner
1517
Scenario: Extract entities from free texts
1618
Given CMS app is up and running
1719
When I send a POST request with the following content
1820
| endpoint | data | content_type |
1921
| /process | Spinal stenosis | text/plain |
2022
Then the response should contain annotations
2123

24+
@ner
2225
Scenario: Extract entities from JSON Lines
2326
Given CMS app is up and running
2427
When I send a POST request with the following jsonlines content
2528
| endpoint | data | content_type |
2629
| /process_jsonl | {"name": "doc1", "text": "Spinal stenosis"}\n{"name": "doc2", "text": "Spinal stenosis"} | application/x-ndjson |
2730
Then the response should contain json lines
2831

29-
@skip
32+
@ner
3033
Scenario: Extract entities from bulk texts
3134
Given CMS app is up and running
3235
When I send a POST request with the following content
33-
| endpoint | data | content_type |
34-
| /process_bulk | ["Spinal stenosis", "Spinal stenosis"] | application/json |
36+
| endpoint | data | content_type |
37+
| /process_bulk | ["Spinal stenosis", "Intracerebral hemorrhage", "Cerebellum"] | application/json |
3538
Then the response should contain bulk annotations
3639

37-
@skip
40+
@ner
3841
Scenario: Extract entities from a file with bulk texts
3942
Given CMS app is up and running
4043
When I send a POST request with the following content where data as a file
41-
| endpoint | data | content_type |
42-
| /process_bulk_file | ["Spinal stenosis", "Spinal stenosis"] | multipart/form-data |
44+
| endpoint | data | content_type |
45+
| /process_bulk_file | ["Spinal stenosis", "Intracerebral hemorrhage", "Cerebellum"] | multipart/form-data |
4346
Then the response should contain bulk annotations
4447

48+
@redaction
4549
Scenario: Extract and redact entities from free texts
4650
Given CMS app is up and running
4751
When I send a POST request with the following content
4852
| endpoint | data | content_type |
4953
| /redact | Spinal stenosis | text/plain |
5054
Then the response should contain text [spinal stenosis]
5155

56+
@redaction
5257
Scenario: Extract and redact entities from free texts with a mask
5358
Given CMS app is up and running
5459
When I send a POST request with the following content
5560
| endpoint | data | content_type |
5661
| /redact?mask=*** | Spinal stenosis | text/plain |
5762
Then the response should contain text ***
5863

64+
@redaction
5965
Scenario: Extract and redact entities from free texts with a hash
6066
Given CMS app is up and running
6167
When I send a POST request with the following content
6268
| endpoint | data | content_type |
6369
| /redact?mask=any&hash=true | Spinal stenosis | text/plain |
6470
Then the response should contain text 4c86af83314100034ad83fae3227e595fc54cb864c69ea912cd5290b8d0f41a4
6571

72+
@redaction
6673
Scenario: Warn when no entities are detected for redaction
6774
Given CMS app is up and running
6875
When I send a POST request with the following content
6976
| endpoint | data | content_type |
7077
| /redact?warn_on_no_redaction=true | abcdefgh | text/plain |
7178
Then the response should contain text warning: no entities were detected for redaction.
7279

80+
@redaction
7381
Scenario: Extract and redact entities if not filtered out
7482
Given CMS app is up and running
7583
When I send a POST request with the following content
7684
| endpoint | data | content_type |
7785
| /redact?concepts_to_keep=C0037944 | Spinal stenosis | text/plain |
7886
Then the response should contain text spinal stenosis
7987

88+
@redaction
8089
Scenario: Extract and redact entities with encryption
8190
Given CMS app is up and running
8291
When I send a POST request with the following content
8392
| endpoint | data | content_type |
8493
| /redact_with_encryption | {"text": "Spinal stenosis", "public_key_pem": "-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA3ITkTP8Tm/5FygcwY2EQ7LgVsuCF0OH7psUqvlXnOPNCfX86CobHBiSFjG9o5ZeajPtTXaf1thUodgpJZVZSqpVTXwGKo8r0COMO87IcwYigkZZgG/WmZgoZART+AA0+JvjFGxflJAxSv7puGlf82E+u5Wz2psLBSDO5qrnmaDZTvPh5eX84cocahVVI7X09/kI+sZiKauM69yoy1bdx16YIIeNm0M9qqS3tTrjouQiJfZ8jUKSZ44Na/81LMVw5O46+5GvwD+OsR43kQ0TexMwgtHxQQsiXLWHCDNy2ZzkzukDYRwA3V2lwVjtQN0WjxHg24BTBDBM+v7iQ7cbweQIDAQAB\n-----END PUBLIC KEY-----"} | application/json |
8594
Then the response should contain encrypted labels
8695

96+
@preview
8797
Scenario: Extract and preview entities
8898
Given CMS app is up and running
8999
When I send a POST request with the following content
90100
| endpoint | data | content_type |
91101
| /preview | Spinal stenosis | text/plain |
92102
Then the response should contain a preview page
93103

104+
@preview
94105
Scenario: Preview trainer export
95106
Given CMS app is up and running
96107
When I send a POST request with the following trainer export
97108
| endpoint | file_name | content_type |
98109
| /preview_trainer_export?project_id=14&document_id=3204 | trainer_export.json | multipart/form-data |
99110
Then the response should contain a preview page
100111

112+
@train
101113
Scenario: Train supervised
102114
Given CMS app is up and running
103115
When I send a POST request with the following trainer export
@@ -109,6 +121,7 @@ Feature:
109121
When I send a GET request to /train_eval_metrics with that ID
110122
Then the response should contain the supervised evaluation metrics
111123

124+
@train
112125
Scenario: Train unsupervised
113126
Given CMS app is up and running
114127
When I send a POST request with the following training data
@@ -120,6 +133,7 @@ Feature:
120133
When I send a GET request to /train_eval_metrics with that ID
121134
Then the response should contain the unsupervised evaluation metrics
122135

136+
@train
123137
Scenario: Evaluate served model
124138
Given CMS app is up and running
125139
When I send a POST request with the following trainer export
@@ -128,3 +142,41 @@ Feature:
128142
Then the response should contain the evaluation ID
129143
When I send a GET request to /train_eval_info with that ID
130144
Then the response should contain the evaluation information
145+
146+
@misc
147+
Scenario: Sanity check the model with a trainer export
148+
Given CMS app is up and running
149+
When I send a POST request with the following trainer export
150+
| endpoint | file_name | content_type |
151+
| /sanity-check | trainer_export.json | multipart/form-data |
152+
Then the response should contain evaluation metrics per concept
153+
154+
@misc
155+
Scenario Outline: Calculate Inter Annotator Agreement (IAA) scores between two annotation projects
156+
Given CMS app is up and running
157+
When I send a POST request with the following trainer export
158+
| endpoint | file_name | content_type |
159+
| /iaa-scores?annotator_a_project_id=14&annotator_b_project_id=15&scope=<scope> | trainer_export.json,another_trainer_export.json | application/json |
160+
Then the response should contain IAA scores
161+
162+
Examples:
163+
| scope |
164+
| per_concept |
165+
| per_document |
166+
| per_span |
167+
168+
@misc
169+
Scenario: Concatenate multiple trainer export files into a single file
170+
Given CMS app is up and running
171+
When I send a POST request with the following trainer export
172+
| endpoint | file_name | content_type |
173+
| /concat_trainer_exports | trainer_export.json,another_trainer_export.json | multipart/form-data |
174+
Then the response should contain a concatenated trainer export
175+
176+
@misc
177+
Scenario: Get annotation stats of trainer export files
178+
Given CMS app is up and running
179+
When I send a POST request with the following trainer export
180+
| endpoint | file_name | content_type |
181+
| /annotation-stats | trainer_export.json,another_trainer_export.json | multipart/form-data |
182+
Then the response should contain annotation stats

tests/integration/features/serving_stream.feature

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
Feature:
22
CogStack ModelServe Stream APIs
33

4+
@ner-stream
45
Scenario: Stream entities extracted from free texts
56
Given CMS stream app is up and running
67
When I send an async POST request with the following jsonlines content
78
| endpoint | data | content_type |
89
| /stream/process | {"name": "doc1", "text": "Spinal stenosis"}\n{"name": "doc2", "text": "Spinal stenosis"} | application/x-ndjson |
910
Then the response should contain annotation stream
1011

12+
@ner-chat
1113
Scenario: Interactively extract entities from free texts
1214
Given CMS stream app is up and running
1315
When I send a piece of text to the WS endpoint

tests/integration/test_steps.py

Lines changed: 51 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -147,17 +147,22 @@ def check_response_bulk(context):
147147
assert context["response"].headers["Content-Type"] == "application/json"
148148
bulk_results = context["response"].json()
149149
assert isinstance(bulk_results, list)
150-
assert len(bulk_results) == 2
150+
assert len(bulk_results) == 3
151151
assert bulk_results[0]["text"] == "Spinal stenosis"
152152
assert bulk_results[0]["annotations"][0]["start"] == 0
153153
assert bulk_results[0]["annotations"][0]["end"] == 15
154154
assert bulk_results[0]["annotations"][0]["label_name"].lower() == "spinal stenosis"
155155
assert isinstance(bulk_results[0]["annotations"][0]["label_id"], str)
156-
assert bulk_results[1]["text"] == "Spinal stenosis"
156+
assert bulk_results[1]["text"] == "Intracerebral hemorrhage"
157157
assert bulk_results[1]["annotations"][0]["start"] == 0
158-
assert bulk_results[1]["annotations"][0]["end"] == 15
159-
assert bulk_results[1]["annotations"][0]["label_name"].lower() == "spinal stenosis"
158+
assert bulk_results[1]["annotations"][0]["end"] == 24
159+
assert bulk_results[1]["annotations"][0]["label_name"].lower() == "cerebral hemorrhage"
160160
assert isinstance(bulk_results[1]["annotations"][0]["label_id"], str)
161+
assert bulk_results[2]["text"] == "Cerebellum"
162+
assert bulk_results[2]["annotations"][0]["start"] == 0
163+
assert bulk_results[2]["annotations"][0]["end"] == 10
164+
assert bulk_results[2]["annotations"][0]["label_name"].lower() == "cerebellum"
165+
assert isinstance(bulk_results[2]["annotations"][0]["label_id"], str)
161166
context["response"].close()
162167

163168
@then(parsers.parse("the response should contain text {redaction}"))
@@ -174,10 +179,16 @@ def check_response_previewed(context):
174179

175180
@when(data_table("I send a POST request with the following trainer export", fixture="request", orient="dict"))
176181
def send_post_training_request_file(context, request):
177-
trainer_export_path = os.path.join(os.path.dirname(__file__), "..", "resources", "fixture", request[0]["file_name"])
178-
with open(trainer_export_path, "rb") as f:
179-
context["response"] = requests.post(f"{context['base_url']}{request[0]['endpoint']}",
180-
files=[("trainer_export", f)])
182+
trainer_export_names = request[0]["file_name"].split(",")
183+
184+
files = []
185+
for trainer_export_name in trainer_export_names:
186+
trainer_export_path = os.path.join(os.path.dirname(__file__), "..", "resources", "fixture", trainer_export_name)
187+
file = open(trainer_export_path, "rb")
188+
files.append(("trainer_export", file))
189+
190+
context["response"] = requests.post(f"{context['base_url']}{request[0]['endpoint']}", files=files)
191+
[file.close() for _, file in files]
181192

182193
@when(data_table("I send a POST request with the following training data", fixture="request", orient="dict"))
183194
def send_post_training_request_file(context, request):
@@ -245,6 +256,38 @@ def check_response_training_id(context):
245256
assert "encryption" in response_json["encryptions"][0]
246257
context["response"].close()
247258

259+
@then("the response should contain evaluation metrics per concept")
260+
def check_response_sanity_check(context):
261+
assert context["response"].status_code == 200
262+
assert context["response"].headers["Content-Type"] == "text/csv; charset=utf-8"
263+
response_lines = context["response"].content.decode("utf-8").splitlines()
264+
assert len(response_lines) > 1
265+
assert "concept,name,precision,recall,f1" == response_lines[0]
266+
context["response"].close()
267+
268+
@then("the response should contain IAA scores")
269+
def check_response_iaa(context):
270+
assert context["response"].status_code == 200
271+
assert context["response"].headers["Content-Type"] == "text/csv; charset=utf-8"
272+
response_lines = context["response"].content.decode("utf-8").splitlines()
273+
assert "iaa_percentage,cohens_kappa,iaa_percentage_meta,cohens_kappa_meta" in response_lines[0]
274+
context["response"].close()
275+
276+
@then("the response should contain a concatenated trainer export")
277+
def check_response_concatenated_trainer_export(context):
278+
assert context["response"].status_code == 200
279+
assert context["response"].headers["Content-Type"] == "application/json; charset=utf-8"
280+
assert len(context["response"].text) == 36918
281+
282+
@then("the response should contain annotation stats")
283+
def check_response_annotation_stats(context):
284+
assert context["response"].status_code == 200
285+
assert context["response"].headers["Content-Type"] == "text/csv; charset=utf-8"
286+
response_lines = context["response"].content.decode("utf-8").splitlines()
287+
assert len(response_lines) > 1
288+
assert "concept,anno_count,anno_unique_counts,anno_ignorance_counts" == response_lines[0]
289+
context["response"].close()
290+
248291
@when(data_table("I send an async POST request with the following jsonlines content", fixture="request", orient="dict"))
249292
@async_to_sync
250293
async def send_async_post_request(context_stream, request):

0 commit comments

Comments
 (0)