Skip to content

Commit afb81a3

Browse files
authored
Update formats (#16)
* add reading hd examiner peptide pool * prepare for identify by path * add protein filter field * refactor loader to reader, rework formats * remove combined schema * fix peptide pool reader and add tests * add kingfisher HD examiner example file * refactor peptide pool reader * refactor: update format handling and improve peptide loading functionality * example of only reading files * fix loading datasets example * add protein field to docs * add hd examiner peptides files * cast exposure can also raise ValueError * group by protein and state such that dataframes can be aggregated in batch * fix hd examiner format identification * aggregate replicate, cluster and charge count * add summary function * add uptake summary table * add load method to formats * fix format names * add uptake summary converter * renew test data with n_charges column * expand docs on n_charges * add function to find offset between structure and peptides * update docstring and allow selecting mutiple columns on join * make publication title required * support loading from .zip files * allow loading form dir also when there is only 1 dataset present * delete comment * pass exception when format is not a string * formatting * remove unused code from zip example * check for stringIO type, format file * add loading .zip test
1 parent 046a3f0 commit afb81a3

26 files changed

+3522
-1662
lines changed

docs/fields.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ residue number of the last amino acid in the peptide
1313
### sequence (str)
1414
fasta sequence of the peptide
1515

16+
### protein (str)
17+
protein name or identifier
18+
19+
HDExaminer name: Protein
20+
DynamX name: Protein
21+
1622
### state (str)
1723
state label
1824

@@ -93,6 +99,9 @@ These fields are derived from other fields defined in the above sections.
9399
added after data aggregation
94100
Total number of replicates that were aggregated together
95101

102+
### n_charges
103+
Total number of different charged states that were aggregated together
104+
96105
### n_clusters
97106
added after data aggregation
98107
Total number of isotopic clusters that were aggregated together. When replicates include multiple isotopic clusters (different charged states), this value will be larger than n_replicates.

docs/hd_examiner_files/HDX export file test.csv

Lines changed: 290 additions & 0 deletions
Large diffs are not rendered by default.

docs/hd_examiner_formats.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,25 @@ FD control: 'MAX' (older version)
159159
Comments:
160160

161161

162+
### Kingfisher HD examiner example
163+
164+
File: HDX export file test.csv
165+
Source: https://github.com/juan2089/Kingfisher-HDX/blob/Kingfisher-v1.1/www/HDX%20export%20file%20test.csv
166+
167+
Columns:
168+
The first line is a header with exposure times.
169+
170+
The second line has the column names, starting with:
171+
'State,Protein,Start,End,Sequence,Search RT,Charge,Max D,'
172+
173+
Followed by repeating blocks of:
174+
'Start RT,End RT,#D,%D,#D right,%D right,Score,Conf,'
175+
Format: (almost!) HD examiner summary file
176+
177+
This is a HD examiner 'peptide pool' file
178+
179+
180+
162181
## HD Examiner manual on exporting data
163182

164183
**Peptide Pool Results / Uptake Summary Table**

examples/from_hxms_file.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from typing import Optional
55

66
from hdxms_datasets.database import populate_known_ids, submit_dataset
7-
from hdxms_datasets.loader import (
7+
from hdxms_datasets.reader import (
88
read_hxms,
99
)
1010
from hdxms_datasets.models import (

examples/from_zip_file.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from hdxms_datasets import load_dataset
2+
from pathlib import Path
3+
4+
DATA_ID = "HDX_C1198C76" # SecA DynamX state data
5+
DATA_ID = "HDX_D9096080" # SecB DynamX state data
6+
7+
fname = "HDX_3BAE2080.zip" # Example dataset in a zip file
8+
9+
# %%
10+
test_pth = Path(__file__).parent.parent / "tests"
11+
database_dir = test_pth / "datasets"
12+
13+
dataset = load_dataset(database_dir / fname) # Should load the dataset from the zip file
14+
15+
print(dataset.states)
16+
17+
# %%

examples/load_local_dynamx_cluster.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
plot_peptides(selected, domain=(0, 1), value="frac_max_uptake")
5454

5555
# %%
56-
peptides = dataset.states[0].peptides[0]
56+
peptides = dataset.states[0].peptides[0].load()
5757
StructureView(dataset.structure).peptide_coverage(peptides)
5858

5959
# %%

examples/load_local_dynamx_state.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@
4444
# load the partially deuterated peptides
4545
df = state.peptides[0].load(
4646
convert=True,
47-
aggregate=True,
48-
# sort_rows=True,
49-
# sort_columns=True,
47+
aggregate=None, # dynamx state data is already aggregated
48+
sort_rows=True,
49+
sort_columns=True,
5050
)
5151
print(df.columns)
5252
# > ['start', 'end', 'sequence', 'state', 'exposure', 'centroid_mz', 'rt', 'rt_sd', 'uptake', 'uptake_sd']
@@ -112,12 +112,12 @@
112112
# %%
113113
# show a single peptide
114114
start, end = processed["start", "end"].row(10)
115-
view = StructureView(dataset.structure).color_peptide(start, end, chain=["A"])
115+
view = StructureView(dataset.structure).color_peptide(start, end)
116116
view
117117

118118
# %%
119119
# select a set of peptides for further viusualization
120-
peptides = dataset.states[0].peptides[0]
120+
peptides = dataset.states[0].peptides[0].load()
121121

122122
# %%
123123
# show regions of the structure that are covered by peptides

examples/load_local_hdexaminer.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@
3535
selected = processed.filter(nw.col("exposure") == exposure_value)
3636
plot_peptides(selected.to_polars(), value="frac_max_uptake", domain=(0, 1))
3737
# %%
38-
# %%
3938

4039
peptides = dataset.states[0].peptides[0]
4140
StructureView(dataset.structure).peptide_coverage(selected)

examples/read_files.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# %%
2+
3+
from pathlib import Path
4+
5+
from hdxms_datasets import identify_format
6+
7+
# %%
8+
9+
cwd = Path(__file__).parent
10+
11+
# %%
12+
13+
# read a hxms file
14+
f = cwd / "test_data" / "ecDHFR" / "ecDHFR_2025-09-23_APO.hxms"
15+
16+
fmt_spec = identify_format(f)
17+
# read to dataframe
18+
df = fmt_spec.read(f)
19+
20+
# convert to open-hdx format
21+
df_converted = fmt_spec.convert(df)
22+
df_converted.to_native()
23+
24+
# %%
25+
# read an dynamx file
26+
f = cwd / "test_data" / "ecSecB" / "ecSecB_apo.csv"
27+
fmt_spec = identify_format(f)
28+
# read to dataframe
29+
df = fmt_spec.read(f)
30+
31+
# convert to open-hdx format
32+
df_converted = fmt_spec.convert(df)
33+
df_converted.to_native()
34+
# %%

examples/test_data/ecDHFR/notes.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,7 @@ the HXMS file format manuscript.
44
Correct source:
55
https://www.biorxiv.org/content/10.1101/2025.10.14.682397v1.supplementary-material
66

7+
8+
9+
ecDHFR tutorial.csv
10+
Source: https://huggingface.co/spaces/glasgow-lab/PFLink

0 commit comments

Comments
 (0)