Skip to content

Commit 30f968b

Browse files
Implements backend API, integrates with frontend (#4)
This PR adds the following: - implements the entire meta2onto API (sorry, probably should've broken it up...) - updates endpoint paths, field references in the frontend to match the backend - adds a facility for the database to load the latest dump on startup, similar to ECCO - downloads the latest database dump if it's not already present on invoking `run_stack.sh` - note that the dump is currently around 850mb and I expect that to grow - if it's missing or doesn't match the remote's exact file size in bytes (I presume this is enough to ensure it's different), you'll be prompted with a message to download it that includes its actual remote size - adds `memcached` for caching expensive responses Regarding the frontend, here are a few significant changes: - the Cart component now directly downloads the response from `/api/cart/download` rather than getting a link and fetching that - the Autocomplete listing now shows the ontology ID alongside the result (feel free to remove if you like; that was mostly for me); it also passes the entity's name to the search page rather than its ID to match the mockup. - I vaguely recall them saying that they wanted to select an ontology ID from the list, but perhaps that's not the case? Right now, the search will match multiple ontology entities as long as their names are the same. - The initial results are currently capped at 50, but due to some joins that exclude the actual number of results can end up winnowing it down to ~47. I'm looking into it return exactly as many results as you ask for while not blowing up the size early in the query. - Search and Cart were updated to reflect the field names I return from the backend ## Caveats Note that the backend API structure is very much a work in progress. For example, I'd like to combine the GEO metadata series table with the regular Series table, decide between whether it's going to be flat or nested, normalize how queries and pagination are done (there are some custom endpoints, for example), etc. There's also currently no formal type definition for the API; I intend to add that, as well as Swagger or ReDoc to surface that information better, as well as add an OpenAPI schema endpoint for programmatic use. The data model is also kind of all over the place, due to not knowing how (or, in many cases, if) the data i was provided would be used in the app. I have a lot of cleanup to do both on the model and the interface, but I imagine that it'll be best to address it in a future PR. The way that searches are conducted is currently complicated, but I figure I should get the PR out first. I'll explain it in a subsequent comment on this PR. ## Trying it Out Since I was only given a subset of ontology terms for testing, only a small subset of entities in the ontology will actually be able to be matched to results. (They're all disease entities from the MONDO ontology, and only a small selection of them at that.) Here's a few entries that you can search for that actually return responses, pulled from the `api_searchterms` table joined with `api_ontologyterms`: Ontology ID |Query Phrase ------------------|-------------------------------- MONDO:0000270|lower respiratory tract disorder MONDO:0000637|musculoskeletal system cancer MONDO:0001416|female reproductive organ cancer --------- Co-authored-by: Vincent Rubinetti <[email protected]>
1 parent cb13ca8 commit 30f968b

File tree

72 files changed

+5473
-213
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+5473
-213
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,3 +217,7 @@ __marimo__/
217217

218218
# MacOS stuff
219219
.DS_Store
220+
221+
# project-specific ignores
222+
# ignore dumpfiles that aren't explicitly checked in
223+
/db-exports/*.dump

backend/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
# Use Python 3.10 slim image as base
2-
FROM python:3.10-slim
1+
FROM python:3.12-slim
32

43
# Set working directory
54
WORKDIR /app

backend/pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,14 @@ dependencies = [
1010
"django-rest-framework>=0.1.0",
1111
"drf-nested-routers>=0.95.0",
1212
"gunicorn>=23.0.0",
13+
"pandas>=2.3.3",
14+
"tqdm>=4.67.1",
1315
"pgpq>=0.9.0",
1416
"psycopg[binary]>=3.2.10",
17+
"django-filter>=25.2",
18+
"python-memcached>=1.62",
19+
"pymemcache>=4.0.0",
20+
"duckdb>=1.4.2",
1521
]
1622

1723
# [build-system]

backend/src/api/__init__.py

Whitespace-only changes.

backend/src/api/admin.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
from django.contrib import admin
2+
3+
from api.models import Organism, Platform, Sample, Series, SeriesRelations
4+
5+
admin.site.register(Organism)
6+
admin.site.register(Platform)
7+
admin.site.register(Sample)
8+
admin.site.register(Series)
9+
admin.site.register(SeriesRelations)

backend/src/api/apps.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
from django.apps import AppConfig
2+
3+
4+
class ApiConfig(AppConfig):
5+
default_auto_field = "django.db.models.BigAutoField"
6+
name = "api"

backend/src/api/management/__init__.py

Whitespace-only changes.

backend/src/api/management/commands/__init__.py

Whitespace-only changes.
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import io
2+
from pathlib import Path
3+
4+
from django.core.management.base import BaseCommand
5+
from django.db import connection, transaction
6+
7+
import pyarrow.parquet as pq
8+
import pyarrow.csv as pacsv
9+
import pyarrow.compute as pc
10+
11+
from api.models import SearchTerm
12+
13+
class Command(BaseCommand):
14+
help = "Import SearchTerm rows from a Parquet file using PostgreSQL COPY"
15+
16+
def add_arguments(self, parser):
17+
parser.add_argument("parquet_path", help="Path to meta2onto_example_predictions.parquet")
18+
parser.add_argument(
19+
"--table",
20+
default=SearchTerm._meta.db_table,
21+
help="Target DB table name (default: SearchTerm._meta.db_table)",
22+
)
23+
24+
def handle(self, *args, **options):
25+
parquet_path = Path(options["parquet_path"])
26+
table_name = options["table"]
27+
28+
# first, truncate SearchTerm
29+
self.stdout.write(f"Truncating table {table_name} ...")
30+
with connection.cursor() as cursor:
31+
cursor.execute(f"TRUNCATE TABLE {table_name} RESTART IDENTITY CASCADE;")
32+
self.stdout.write(self.style.SUCCESS(f"Table {table_name} truncated."))
33+
34+
# 1) Read Parquet
35+
self.stdout.write(f"Reading Parquet file: {parquet_path}")
36+
table = pq.read_table(parquet_path)
37+
38+
# 2) Select and rename columns to match DB schema
39+
# Parquet: term, ID, prob, log2(prob/prior), related_words
40+
# DB: term, sample_id, prob, log2_prob_prior, related_words
41+
table = table.select(["term", "ID", "prob", "log2(prob/prior)", "related_words"])
42+
table = table.rename_columns(
43+
["term", "sample_id", "prob", "log2_prob_prior", "related_words"]
44+
)
45+
46+
# # If ID might be missing / null, ensure it's numeric or null
47+
# # (Arrow usually infers this correctly; adjust if needed)
48+
# if table["sample_id"].type not in (pc.field("dummy", pc.int64()).type,):
49+
# # Try to cast to int64; errors='ignore' will produce nulls for bad values
50+
# table = table.set_column(
51+
# table.schema.get_field_index("sample_id"),
52+
# "sample_id",
53+
# pc.cast(table["sample_id"], pc.int64()),
54+
# )
55+
56+
# 3) Write to an in-memory CSV with header
57+
self.stdout.write("Converting Arrow table to CSV in memory...")
58+
buf = io.BytesIO()
59+
pacsv.write_csv(table, buf)
60+
buf.seek(0)
61+
62+
# 4) COPY into PostgreSQL using psycopg3's copy()
63+
self.stdout.write(f"Copying into table {table_name} ...")
64+
cols = ["term", "sample_id", "prob", "log2_prob_prior", "related_words"]
65+
copy_sql = f"""
66+
COPY {table_name} ({", ".join(cols)})
67+
FROM STDIN WITH (FORMAT csv, HEADER true)
68+
"""
69+
70+
# buf is bytes; psycopg3 Copy.write() accepts bytes for text/binary COPY
71+
with transaction.atomic():
72+
with connection.cursor() as cursor:
73+
with cursor.copy(copy_sql) as copy: # <-- psycopg3 API
74+
while True:
75+
chunk = buf.read(1024 * 1024) # 1 MB chunks
76+
if not chunk:
77+
break
78+
copy.write(chunk)
79+
80+
self.stdout.write(self.style.SUCCESS("Import completed successfully."))

0 commit comments

Comments
 (0)