Skip to content

Commit c2696ae

Browse files
authored
hdxms-datasets version 0.3.0 (#15)
* remove residue offset * use cif structure in example * assign filter result to variable * add peterle et al test data * delete test data * store current known ids * add hdx_id to spec * add hdx_id to spec * remove unused import of re from models.py * only check for unique if state column is present * update fields information * pull out extra column conversion to function * add converter for hxms format * cleanup double check of dataset id * add hxms format * add loading of hxms format * make license explict requirement * add hxms peptide format * add mock intensity column if missing * add frameslicer util * generate residue dataframe from sequence * bump narwhals requirement * fix typo * rename conversion example scripts * add hxms file format example * StructureView takes a mapping to relate peptides to structure * custom serializers for filters * add new StructureMapping model * move to StructureMapping * update to use StructureMapping model * add cli docs * cli docs tweaks * update to reflect changes to argument vs prompt * cleanup hxms file * tweak template format * refactor structure, publish to output dir * add comment * update folder name for better readability * refactor publication and metadata handling in dataset definition * validate for unique state names * formatting * Add MVP dataset builder * add dataset builder docs * update gitignore * add clear data test button * push test data file through backend * add test data to frontend * load test data in both frond and backend * fix enable/disable steps * use computed getters * limit uploaded structures to one * remove 'confirmed format' references * read and cache dataframes * Enhance DataframeCache with LRU eviction and session tracking; add comprehensive tests for cache functionality * add structure metadata section * fix typing and unused import * merge steps 2 and 3 * fixes in protein identifier names * remove structure information section * fix step name * add structure mappings * fix structure mapping chain input * remove unused format selection, move other elements * small tweaks * add data filters depending on format * v1 filters * filters should be record * fix filters in ui * sort filter values * add update peptide filter action to data store * fitler dynamx by protein, state, exposure * add percentage deuterium to cli create * mount static files for production * mount health under /api * script for starting in preview mode * add session id in header in dev mode * remove dataset builder * add cli tests * update secb dataset to 030 format * remove intensity column from reference data * rename vault to db * allow dictionary mapping of residue numbers * use mapping.map to map residue numbers from hdxms to structure * return df and dont write it * output to the same directory as input data dir * add DHFR test data * add n_replicates and n_clusters fields * script for regenerating processed test data * parameterize processing test * we dont expect sequence after reading hdxs files * move clear function to remote database * add ecDHFR HXMS format data to readme * try/except loading peptides * import submit dataset from init * feat: strict checking, including version number * feat: update Python version matrix to include 3.14 * remove unused imports * drop py3.10, add py3.14 * fix: odnt use color in CLI test invocations * update python versions, add uv lockfile * update python versions * merge lockfiles * need only one lockfile per matrix * update pinned requirements * add uv lockfile to requirements * fix: no color via env var * strip ansi codes
1 parent 6399cd5 commit c2696ae

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+76929
-3152
lines changed

.github/workflows/pin_requirements.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
fail-fast: false
1111
matrix:
1212
os: [ubuntu-latest, windows-latest, macOS-latest]
13-
python-version: ["3.10", "3.11", "3.12", "3.13"]
13+
python-version: ["3.11", "3.12", "3.13", "3.14"]
1414
steps:
1515
- name: Checkout code
1616
uses: actions/checkout@v3
@@ -29,11 +29,13 @@ jobs:
2929
with:
3030
name: req-artifact-${{ matrix.os }}-${{ matrix.python-version }}
3131
path: requirements-${{ matrix.os }}-${{ matrix.python-version }}.txt
32+
33+
3234
merge:
3335
runs-on: ubuntu-latest
3436
needs: generate-requirements
3537
steps:
36-
- name: Merge Artifacts
38+
- name: Merge Requirements Artifacts
3739
uses: actions/upload-artifact/merge@v4
3840
with:
3941
name: all-requirements

.github/workflows/pytest.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jobs:
66
strategy:
77
fail-fast: false
88
matrix:
9-
python-version: ["3.10", "3.11", "3.12", "3.13"]
9+
python-version: ["3.11", "3.12", "3.13", "3.14"]
1010
runs-on: ubuntu-latest
1111
steps:
1212
- name: Check out repository

.gitignore

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,4 +126,16 @@ __datasets/
126126
dev/
127127

128128
# unpublished datasets
129-
datasets_private/
129+
datasets_private/
130+
131+
# Node
132+
node_modules/
133+
npm-debug.log*
134+
yarn-debug.log*
135+
-error.log*
136+
dist/
137+
dist-ssr/
138+
*.local
139+
140+
141+
.claude

docs/cli.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Command Line Interface (CLI)
2+
3+
The `hdxms-datasets` package provides a command-line interface to help you create and manage HDX-MS datasets.
4+
5+
## Installation
6+
7+
First, install the package with the CLI dependencies:
8+
9+
```bash
10+
pip install -e .
11+
```
12+
13+
After installation, the `hdxms-datasets` command will be available in your terminal.
14+
15+
## Commands
16+
17+
### `hdxms-datasets create`
18+
19+
Create a new HDX-MS dataset with a unique ID and template script.
20+
21+
**Basic usage:**
22+
23+
```bash
24+
hdxms-datasets create
25+
```
26+
27+
This will:
28+
1. Generate a unique HDX dataset ID (e.g., `HDX_A1B2C3D4`)
29+
2. Create a new directory in the current directory: `<HDX_ID>/`
30+
3. Generate a template `create_dataset.py` script with configuration
31+
4. Create a `data/` subdirectory for your raw data files
32+
5. Generate a `README.md` with quick start instructions
33+
34+
**Options:**
35+
36+
- `--num-states, -n INTEGER`: Number of protein states (default: 1)
37+
- `--format, -f CHOICE`: Data format - OpenHDX, DynamX_v3_state, DynamX_v3_cluster, HDExaminer (default: OpenHDX)
38+
- `--ph FLOAT`: Experimental pH (default: 7.5)
39+
- `--temperature, -t FLOAT`: Temperature in Kelvin (default: 293.15)
40+
- `--database-dir, -d PATH`: Path to existing database directory to check for ID conflicts
41+
- `--help`: Show help message
42+
43+
**Examples:**
44+
45+
```bash
46+
# Create with defaults (OpenHDX, 1 state, pH 7.5, 20°C)
47+
hdxms-datasets create
48+
49+
# Create with custom parameters
50+
hdxms-datasets create --num-states 2 --format DynamX_v3_state --ph 8.0 --temperature 298.15
51+
52+
# Using short flags
53+
hdxms-datasets create -n 3 -f HDExaminer --ph 7.0 -t 293.15
54+
55+
# Check for ID conflicts with existing database
56+
hdxms-datasets create --database-dir ~/hdx-database/datasets
57+
```
58+
59+
## Configuration via Arguments
60+
61+
All dataset configuration is specified via command-line arguments:
62+
63+
- **Number of states** (`--num-states`): How many different protein states you measured (default: 1)
64+
- **Data format** (`--format`): Which software generated your data (default: OpenHDX)
65+
- `OpenHDX` - OpenHDX format
66+
- `DynamX_v3_state` - DynamX state files
67+
- `DynamX_v3_cluster` - DynamX cluster files
68+
- `HDExaminer` - HDExaminer files
69+
- **pH** (`--ph`): Experimental pH value (default: 7.5)
70+
- **Temperature** (`--temperature`): Temperature in Kelvin (default: 293.15 K = 20°C)
71+
72+
## Workflow Example
73+
74+
```bash
75+
# Step 1: Create a new dataset with custom parameters
76+
$ hdxms-datasets create --num-states 2 --format DynamX_v3_state --ph 8.0
77+
78+
✓ Generated new dataset ID: HDX_A1B2C3D4
79+
============================================================
80+
✓ Dataset template created successfully!
81+
============================================================
82+
83+
Dataset ID: HDX_A1B2C3D4
84+
Location: C:\Users\username\HDX_A1B2C3D4
85+
Format: DynamX_v3_state
86+
States: 2
87+
pH: 8.0
88+
Temperature: 293.15 K (20.0°C)
89+
90+
Next steps:
91+
1. cd HDX_A1B2C3D4
92+
2. Place your data files in the data/ directory
93+
3. Edit create_dataset.py with your specific information
94+
4. python create_dataset.py
95+
96+
# Step 2: Navigate to the new directory
97+
$ cd HDX_A1B2C3D4
98+
99+
# Step 3: Copy your data files
100+
$ copy C:\path\to\my\data.csv data\
101+
102+
# Step 4: Edit the template script
103+
$ notepad create_dataset.py
104+
# Edit the file with your specific information:
105+
# - Replace protein sequences
106+
# - Update data file names
107+
# - Add author information
108+
# - Add publication details
109+
110+
# Step 5: Run the script to create your dataset
111+
$ python create_dataset.py
112+
✓ Dataset submitted successfully with ID: HDX_A1B2C3D4
113+
Dataset location: C:\Users\username\HDX_A1B2C3D4\dataset\HDX_A1B2C3D4
114+
```
115+
116+
## Generated Template Structure
117+
118+
After running `hdxms-datasets create`, you'll have:
119+
120+
```
121+
HDX_A1B2C3D4/
122+
├── create_dataset.py # Template script to edit
123+
├── README.md # Quick start guide
124+
└── data/ # Directory for your raw data files
125+
```
126+
127+
The `create_dataset.py` template includes:
128+
- Clearly marked sections to edit
129+
- Inline comments explaining each field
130+
- List-based structure for protein states and peptides (flexible and easy to extend)
131+
- Pre-configured pH and temperature values from your command-line arguments
132+
- Example values to guide you
133+
- Automatic sequence verification
134+
- Dataset submission code
135+
136+
Please note that this template is not exhaustive and other metadata fields may be used
137+
depending on your dataset's requirements.
138+
139+
## Future Commands (Planned)
140+
141+
The CLI is designed to be extensible. Future commands may include:
142+
143+
- `hdxms-datasets validate`: Validate a dataset before submission
144+
- `hdxms-datasets upload`: Upload a dataset to a remote database
145+
- `hdxms-datasets export`: Export a dataset to different formats
146+
147+
## Getting Help
148+
149+
For more information about any command:
150+
151+
```bash
152+
hdxms-datasets --help
153+
hdxms-datasets create --help
154+
```

docs/fields.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
# Fields
2+
3+
This document describes the fields used in open-hdxms files. The fields are divided into required, optional, and calculated fields.
4+
5+
Some fields can be both calculated from raw data (ie uptake) or provided directly
6+
17
### start (int)
28
residue number of the first amino acid in the peptide
39

@@ -12,6 +18,7 @@ state label
1218

1319
DynamX state/cluster name: State
1420
HDExaminer name: Protein State
21+
hxms name: PROTEIN_STATE
1522

1623
### replicate (str)
1724
Label for the replicate
@@ -64,7 +71,9 @@ DynamX name?? is this max or mean intensity?
6471
These fields can be present in open-hdxms files, but can also be calculated from the other fields.
6572

6673
### max_uptake (int)
67-
Theoretical maximum deuterium uptake for the peptide. Typically equal to the number of amide hydrogens, thus number of non-proline residues minus one.
74+
Theoretical maximum deuterium uptake for the peptide. Equal to the number of
75+
non proline residues. Not that back-exchange is not considered here, including
76+
back exchange of the N-terminal amide.
6877

6978

7079
### uptake (float)
@@ -80,7 +89,13 @@ Standard deviation of the uptake value
8089
## Calculated fields:
8190
These fields are derived from other fields defined in the above sections.
8291

92+
### n_replicates
93+
added after data aggregation
94+
Total number of replicates that were aggregated together
8395

96+
### n_clusters
97+
added after data aggregation
98+
Total number of isotopic clusters that were aggregated together. When replicates include multiple isotopic clusters (different charged states), this value will be larger than n_replicates.
8499

85100
### frac_fd_control (float)
86101
Fractional deuterium uptake with respect to fully deuterated control sample

dynamx_state.pq

-23.2 KB
Binary file not shown.
Lines changed: 18 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -138,23 +138,6 @@
138138

139139
# %%
140140

141-
pub = Publication(
142-
title="Simple and Fast Maximally Deuterated Control (maxD) Preparation for Hydrogen-Deuterium Exchange Mass Spectrometry Experiments",
143-
doi="10.1021/acs.analchem.2c01446",
144-
url="https://pubs.acs.org/doi/10.1021/acs.analchem.2c01446",
145-
)
146-
147-
# %%
148-
# Make sure to add the correct licsense for your dataset
149-
# If you are the author, you can choose any license you like
150-
# The preferred / default license is CC0
151-
metadata = DatasetMetadata( # type: ignore[call-arg]
152-
authors=[Author(name="Daniele Peterle", affiliation="Northeastern University")],
153-
publication=pub,
154-
license="CC BY-NC 4.0",
155-
conversion_notes="Converted published Supplementary data",
156-
)
157-
158141
protein_info = ProteinIdentifiers(
159142
uniprot_accession_number="P68082",
160143
uniprot_entry_name="MYG_HORSE",
@@ -173,10 +156,8 @@
173156
structure = Structure(
174157
data_file=data_dir / "1azi.cif",
175158
format="cif",
176-
description="",
159+
description="MYOGLOBIN (HORSE HEART) RECOMBINANT WILD-TYPE COMPLEXED WITH AZIDE ",
177160
pdb_id="1AZI",
178-
residue_offset=0, # HDX data residue numbers match the PDB, no offset
179-
auth_residue_numbers=False, # HDX data residue numbers are RCSB numbering (not author or is the same)
180161
)
181162

182163
# define the sequence in this protein state
@@ -217,20 +198,18 @@
217198
pH=7.1,
218199
temperature=20 + 273.15,
219200
d_percentage=90.0,
220-
chain=["A"],
221201
)
222202

223203
fd_peptides = Peptides( # type: ignore[call-arg]
224204
data_file=data_dir / "1_Mb_fd_peptides.csv",
225205
data_format=PeptideFormat.OpenHDX,
226206
deuteration_type="fully_deuterated",
227207
d_percentage=90.0,
228-
chain=["A"],
229208
)
230209

231210
# %%
232211
# we can create a view of the structure and for example check peptide redundancy
233-
StructureView(structure).peptide_redundancy(pd_peptides)
212+
StructureView(structure).peptide_redundancy(pd_peptides.load())
234213

235214
# %%
236215
# This dataset has only one state, which is WT
@@ -242,10 +221,25 @@
242221

243222
# %%
244223

224+
pub = Publication(
225+
title="Simple and Fast Maximally Deuterated Control (maxD) Preparation for Hydrogen-Deuterium Exchange Mass Spectrometry Experiments",
226+
doi="10.1021/acs.analchem.2c01446",
227+
url="https://pubs.acs.org/doi/10.1021/acs.analchem.2c01446",
228+
)
229+
230+
# Make sure to add the correct licsense for your dataset
231+
# If you are the author, you can choose any license you like
232+
# The preferred / default license is CC0
233+
245234
dataset = HDXDataSet( # type: ignore[call-arg]
246235
states=[state],
247236
description="1 Mb dataset from Peterle et al. 2022",
248-
metadata=metadata,
237+
metadata=DatasetMetadata( # type: ignore[call-arg]
238+
authors=[Author(name="Daniele Peterle", affiliation="Northeastern University")],
239+
publication=pub,
240+
license="CC BY-NC 4.0",
241+
conversion_notes="Converted published Supplementary data",
242+
),
249243
protein_identifiers=protein_info,
250244
structure=structure,
251245
)

0 commit comments

Comments
 (0)